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EXECUTIVE  SUMMARY 


PROGRAM  PLAN  FOR  DEVELOPMENT  OF  FAULT-TOLERANT 
SYSTEM  TECHNOLOGY 


FAULTS  AND  TOLERANCE 

A  fault  is  an  abnormal  condition  the  appearance  of  which  may  force  a  system  into  an 
error  state  leading  to  partial  or  complete  failure  to  execute  its  required  function.  Tolerance 
is  that  property  which  allows  the  system  to  perform  in  the  expected  way  despite  the  appear¬ 
ance  of  explicitly  specified  classes  of  potentially  disabling  faults. 


VALUE  TO  THE  SYSTEM 

Fault  tolerance  is  related  to  both  reliability  and  maintainability.  One  of  the  projects 
within  this  program  plan,  in  fact,  is  concerned  precisely  with  defining  reliability  and  maintain¬ 
ability  to  take  into  consideration  degrees  of  tolerance  and  other  factors  not  presently  con¬ 
sidered  so  that  fault  tolerance  technology  may  be  more  effectively  applied. 

With  reliability  and  maintainability  fault  tolerance  significantly  affects  system  effec¬ 
tiveness,  logistic  supportability,  and  life-cycle  cost. 

Unavailability  at  a  time  of  critical  need  could  result  in  loss  of  men.  material  and  tac¬ 
tical  advantage.  Effectiveness  is  totally  dependent  upon  availability  and  availability  is  directly 
proportional  to  the  fault  tolerance  of  the  system. 

Supportability  requires  a  match  between  system  maintenance  needs  and  available 
maintenance  resources.  Continuing  growth  of  system  complexity  is  tolerable  only  if  the 
accompanying  maintenance  requirements  are  prevented  somehow  from  rising  through  the 
ceiling.  Built-in  test  and  other  fault  tolerance  techniques  must  be  employed  to  keep  this  from 
happening.  With  BIT,  one  internal  failure  can  be  detected  and  corrected  at  a  reasonable  ex¬ 
penditure  of  energy  and  time  before  another  follows  and  the  system  is  out  of  combat 
altogether. 

System  life-cycle  costs  range  from  3  to  20  times  procurement  costs,  and  maintenance 
costs  make  up  a  large  slice  of  the  pie.  Test  equipment,  training,  and  documentation  contrib¬ 
ute  heavily  to  maintenance  costs,  but  the  measure  of  each  of  them  can  be  held  down  through 
fault  tolerance. 

So  fault  tolerance  contributes  to  military  systems  in  the  areas  of  effectiveness,  sup¬ 
portability.  and  life-cycle  cost.  What  value  can  be  put  upon  it?  This  must  be  measured 
against  the  importance  of  the  system  to  the  mission  Fault  tolerance  may  be  an  unnecessary 
refinement  in  one  application,  merely  desirable  in  a  second,  and  essential  in  the  third. 

In  a  mission-critical  system  such  as  V/STOL  (vertical/short  take  off  and  landing  air¬ 
craft)  for  example,  it  must  be  considered  essential  on  the  basis  of  the  following  and  other 
contributions: 

It  minimizes  risks  to  users  and  equipment  associated  with  component  and  device 

failure. 

It  eliminates  unacceptable  maintenance  downtime  in  real-time  environments. 

It  allows  systems  to  be  used  in  environments  to  which  access  is  restricted. 

It  opens  the  possibility  of  lower  total  life-cycle  costs  and  of  lower  initial  costs  for  a 
given  reliability  goal. 

It  provides  psychological  support  to  system  users. 
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ROADBLOCKS 


A  program  tor  the  development  and  implementation  of  a  technology  promising  all 
these  advantages  should  not  be  difficult  to  promote.  It  would  be  well,  nevertheless,  to  look 
closely  at  the  state  of  the  art  today  before  we  plot  our  route  to  the  mature  and  usable  tech¬ 
nology  of  the  future.  We  find  it  is  characterized  by  a  number  of  lacks,  by  fragmentation,  by 
inertia,  and  by  resistance. 


LACK  OF  CONTINUITY.  Some  fault  tolerance  techniques  were  developed  for  first- 
generation  computers  but  were  discarded  because  the  second-generation  computer  demon¬ 
strated  much  higher  reliability  in  semiconductor  and  magnetic-core  components.  In  addition, 
many  problem  solutions  were  not  openly  documented,  causing  much  reinvention  of  tech¬ 
niques.  Also,  many  errors  were  repeated  with  subsequent  loss  of  confidence  in  those 
techniques. 

LACK  OF  COST/BENEFIT  MEASURES.  To  date  there  are  no  general  methods  for 
a  convenient  quantitative  assessment  of  the  benefits  of  fault  tolerance  in  terms  of  life-cycle 
cost  reduction.  The  initial  extra  cost  due  to  the  use  of  redundancy  techniques  is  more 
directly  evident  and  tends  to  bias  in  favor  of  systems  without  fault  tolerance  a  large  class  of 
users  who  do  not  have  an  absolute  requirement. 


LACK  OF  SPECIFICATIONS  AND  ACCEPTANCE  TESTS.  The  user  community  at 
large  does  not  have  adequate  knowledge  of  the  properties  and  limitations  of  fault  tolerance. 

As  a  consequence,  specifications  for  reliability  are  insufficiently  precise  and  virtually  unveri- 
fiahle  in  advance  of  system  use.  For  example,  reliability  requirements  for  a  given  time  inter¬ 
val  do  not  specify  classes  of  faults  and  do  not  define  what  constitutes  acceptable  recovery. 

FRAGMENTATION  OF  EFFORT.  Program  efforts  to  increase  reliability  of  systems 
originate  within  several  disciplines  of  engineering  theory  and  practice.  These  include  system 
architecture,  software  engineering,  testing  and  design  verification,  design  of  data  base  manage¬ 
ment  systems,  computer  networks  and  communication  systems,  component  and  packaging 
engineering,  and  field  operation  and  maintenance.  Although  they  have  a  common  goal,  these 
efforts  have  remained  largely  disjointed.  A  lack  of  common  viewpoint  and  systematic  com¬ 
munication  is  evident.  There  is  also  a  gap  between  the  results  of  theoretical  investigations 
and  practical  engineering  solutions  to  fault  tolerance  problems. 

INERTIA  IN  THE  DESIGN  PROCESS.  Introduction  of  fault  tolerance  in  system  de¬ 
sign  requires  an  early  commitment  and  a  significant  departure  from  the  traditional  evolution¬ 
ary  design  of  system  or  subsystem  "product  line”  in  which  compatibility  of  software  is  usually 
a  dominant  factor.  While  the  number  of  fault  tolerance  techniques  at  hand  to  serve  as  main¬ 
tenance  aids  has  been  increasing,  none  of  the  major  manufacturers  have  announced  a  fully 
lault-tolerant  line  of  computers  or  subsystems.  The  only  fault-tolerant  systems  actually  de¬ 
livered  have  been  custom  made  to  special  requirements. 


PROGRAM  PLAN  -  THREE  OBJECTIVES  AND  32  PROJECTS 

The  Naval  Ocean  Systems  Center  (NOSC)  with  Air  Force  sponsorship  has  developed  a 
Fault  Tolerance  Program  Plan  which  represents  a  systematic  assault  on  the  roadblocks  de¬ 
scribed  above.  It  has  three  prime  objectives  and  is  structured  in  32  individual  projects.  The 
objectives  and  the  projects  associated  with  them  are  listed  below.  Several  projects  apply  to 
two  objectives,  and  one  applies  to  three.  Top-priority  projects  are  asterisked. 

1.  Clarify  operational  requirements  and  relate  them  to  system  specifications. 

Fault  Tolerance  Requirements  Definition  and  Interpretation* 

Mission  Availability  Analysis  Methods* 

Project  DAIS  Fault  Tolerance  Evaluation* 

Relation  of  Requirements  to  Specification* 

Command  Control  Fault  Tolerance* 

Standards  and  Fault  Tolerance* 

Self-Diagnosing  Design  Techniques* 

Design/Development  Tools  (Methodology)* 

TECHEVAL-OPEVAL  Techniques* 

Fault  Tolerance  Design  Handbook 
Specification  for  Redundancy  Management 
Architectures  for  Availability  Requirements 


2.  Establish  alternate  system  design  method  for  balancing  operational  capabilities 
against  life-cycle  costs. 

Self-Diagnosing  Design  Techniques* 

Design/Development  Tools  (Methodology)* 

Acceptance  Testing* 


Establishment  of  Redundancy  Limits/Tradeoffs 

Failure-Fault  Prediction  Technology 

Recovery  Techniques 

Transient  Faults 

Identification  of  Failure  Modes 

Fault-Tolerant  Software 

Analog  Functional  Redundancy 

Estimation  of  Confidence  Limits  Testing  Large  Logic  Networks 
Redundant  Microcomputers 

Loosely  Coupled  Fault-Tolerant  Computer  Networks 
Fault  Tolerance  Masking  Hazards 
Fault  Tolerance  Design  Handbook 
Reliability/Fault  Tolerance  Analysis  and  Design  Tools 


3.  Develop  understanding  of  system  engineering  relationships  between  reliability, 
fault  tolerance,  repairability,  and  logistics  support. 

Fault  Tolerance  Life-Cycle  Cost  Impacts  -  Maintenance  Model* 

Theory  of  Testing  Taxonomy* 

Reliability  Measure  of  Testability  Concept* 

High-Order  Language  Constructs  for  Fault-Tolerant  Systems* 

Functional  Test  Design  Theory* 

Fault  Tolerance  Validation  and  Verification* 

Acceptance  Testing* 

Standards  and  Fault  Tolerance* 

TECHEVAL-OPEVAL  Techniques* 

Communication  Protocol  for  Fault  Tolerance 
Reliability/Fault  Tolerance  Analysis  and  Design  Tools 
Alternate  System  Design  Evaluation 
Fault  Tolerance  Masking  Hazards 
Fault  Tolerance  Design  Handbook 

You  will  have  noticed  that  the  Fault  Tolerance  Design  Handbook  is  the  project  com¬ 
mon  to  all  three  objectives. 

IMPLEMENTATION  STRATEGY 

Development  of  the  fault  tolerance  technology  by  the  military  must  be  in  consonance 
with  the  efforts  of  other  governmental,  industrial,  and  university  organizations.  It  must  ex¬ 
ploit  the  results  of  related  research  supported  by  National  Science  Foundation.  National 
Aeronautics  and  Space  Agency,  and  other  organizations. 

Cost  and  technical  risk  will  not  allow  development  of  fault  tolerance  technology  for  a 
single  specific  system.  The  technology  must  be  applicable  to  all  systems.  Basic  ideas  must  be 
funded  first .  Initial  targets  tor  application  must  be  as  broad  as  electronics/avionics,  communi¬ 
cations.  surveillance,  and  command  control.  Later  effort  can  be  funneled  into  such  specific- 
fields  as  computer  netting,  human  error  in  systems,  data  base  fault  tolerance,  fault-tolerant 
hardware,  and  fault/failure  prediction. 

The  military  must  sponsor  technology  transfer  between  DoD  and  civilian  communi¬ 
ties.  Application-oriented  workshops  are  suggested  as  one  effective  medium.  It  is  also  strongly 
recommended  that  the  military  supply  the  needed  fault  tolerance  literature  developing, 
maintaining,  and  distributing  a  series  of  publications  that  reflect  the  baseline  of  the 
technology. 
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1 .0  The  Fault  Tolerance  Problem 


1 . 1  Purpose  of  the  plan.  Military  goals  and  strategies  strongly  emphasize  increased  system 
availability,  simplified  maintenance,  and  reduced  life-cycle  costs.  Yet,  the  technical  areas  of 
reliability,  fault  tolerance,  and  maintainability  (which  are  paramount  to  mission  success  in 
tactical  systems)  suffer  from  several  critical  problems.  Solutions  to  these  problems  are  the 
objective  of  this  research  and  development  program  plan.  The  basic  objectives  of  the  program 
plan  are : 

a.  To  clarify  operational  requirements  and  to  relate  them  to  statement  of  system 
specifications 

b.  To  establish  a  method  for  evaluating  alternate  system  designs  that  balance  opera¬ 
tional  capability  against  life-cycle  costs,  especially  for  degraded  modes  of  operation 

c.  To  develop  understanding  of  system  engineering  relationships  between  reliability, 
fault  tolerance,  repairability,  and  support 

Some  specific  problems  related  to  the  rapid  advance  in  technology  and  the  increasing 
complexity  of  systems  are: 

a.  Disparity  in  evolution  of  fault-tolerant  design  and  LSI  technology 

b.  Lack  of  system  reliability  and  performance  estimation  tools  for  risk  analysis,  de¬ 
sign.  and  system  verification 

c.  Lack  of  awareness  on  the  part  of  designers  of  the  various  techniques  for  designing 
fault-tolerant  systems 

d.  Lack  of  evaluation  methods  for  proving  the  effectiveness  of  fault  tolerance 

Modern  design  of  a  system  of  the  future  requires  the  understanding  of  systems  opera¬ 
tion,  reliability  design,  fault  tolerance  partitioning,  maintenance  design  (including  the  opera¬ 
tion  and  design  of  equally  complex  test  equipment),  and  other  aspects  of  the  overall  design 
process.  This  program  plan  is  based  on  addressing  these  needs  from  an  integrated  viewpoint, 
beginning  with  operational  requirements  and  ending  with  final  system  acceptance  testing  and 
evaluation.  This  plan  is  written  to  achieve  the  development  of  fault-tolerant  systems  technol¬ 
ogy  through  a  concentrated,  prioritized,  and  phased  program. 

1 .2  Definition  of  fault  tolerance.  A  fault  is  an  abnormal  condition  that  appears  during 
the  operation  of  a  system.  Its  appearance  may  or  may  not  cause  a  departure  from  the  ex¬ 
pected  behavior  and  force  the  system  into  an  undesirable  (error)  state  or  sequence  of  error 
status.  The  arrival  at  an  error  state,  in  turn,  leads  to  a  partial  or  complete  failure  of  the  sys¬ 
tem  to  execute  the  required  function  unless  provisions  exist  to  cause  a  return  to  the  expected 
behavior.  Causes  of  faults  are  either  adverse  physical  phenomena  that  can  be  a  temporary  or 
a  permanent  failure,  an  external  interface,  or  human  error.  Because  of  their  disruptive  effect 
on  system  operation,  the  avoidance  and/or  tolerance  of  faults  are  system  problems  involving 
and  related  to  the  design,  analysis,  management,  maintenance,  and  use  of  systems. 

Fault  tolerance  is  that  property  of  a  system  that  allows  the  system  to  perform  in  an 
exp  -cted  way  regardless  of  the  appearance  of  certain  (explicitly  specified)  classes  of  faults 
that  would  otherwise  force  the  system  into  an  error  state. 

Fault  tolerance  is  related  to  both  reliability  and  maintainability.  The  ease  and  rapidity 
with  which  a  failed  subsystem  or  equipment  can  be  restored  to  operational  status  after  a  fail¬ 
ure  directly  affects  the  fault  tolerance  of  the  system  For  example,  if  there  is  single 


redundancy  of  a  particular  subsystem  and  there  is  one  failure,  the  system  then  becomes  fault- 
intolerant  until  the  repair  is  completed.  This  fault  intolerance  period  is  a  characteristic  of 
design  and  installation  which  is  expressed  as  a  probability  that  an  item  will  be  retained  in.  or 
restored  to,  a  specified  condition  within  a  given  period  of  time  when  prescribed  recovery 
techniques  are  performed. 

This  fault  intolerance  measure,  shown  in  figure  1 .  can  be  expressed  either  as  a  measure 
of  the  time  (T)  required  to  recover  from  a  given  percentage  of  all  failures,  or  as  a  probability 
(F)  of  restoring  the  system  to  operational  status  within  a  period  of  time  following  a  fault  or 
failure. 

1.3  Fault  tolerance-maintainability  relationship.  There  is  an  important  relationship  be¬ 
tween  fault  tolerance  and  system  maintainability.  Both  deal  with  the  properties  of  the  entire 
system  under  the  appearance  of  classes  of  faults  and/or  failures.  Both  areas  are  concerned 
with  detection  and  location  of  faults/failures.  After  detection  and  location,  fault  tolerance  is 
concerned  with  recovery  procedures,  while  maintainability  is  concerned  with  re pairability . 

In  fact,  the  relationship  is  so  close  that  fault  tolerance  should  necessarily  be  considered 
a  part  of  maintainability  or  vice  versa.  Additionally,  this  view  simplifies  and  strengthens  the 
designer's  concept  of  fault  tolerance  and  ultimately  will  produce  systems  that  are  more  effec¬ 
tive.  This  is  true  even  tor  systems  such  as  satellite  electronics  which  cannot  conveniently  be 
repaired;  however,  the  functional  partitioning  and  structure  of  hardware  are  improved  by 
mechanical  design.  Hierarchical  design  into  functional  and  mechanical  modules  and  compo¬ 
nents  is  critical  to  the  entire  design  process. 

It  should  he  noted  that  fault  tolerance  anil  maintainability  are  not  necessarily  always 
correlated  in  a  positive  sense.  Redundancy  for  fault  tolerance  can  have  the  effect  of  masking 
faults  and  of  compounding  the  fault-detection  problem,  (  are  must  be  taken  during  design  to 
ensure  that  faults  are  not  masked  from  the  maintenance  man,  else  overall  system  reliability 
will  suffer. 

1  4  Importance  ot  fault  tolerance.  The  basic  value  of  a  system  is  determined  by  three 
fundamental  factors:  system  effectiveness,  logistic  supportability.  and  life-cycle  costs.  All 
three  are  dependent  on  the  reliability,  fault  tolerance,  and  maintainability  characteristics  of 
the  system,  and  all  should  be  an  important  consideration  in  planning  acquisition  of  a  new 
system. 
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a.  System  effectiveness.  Reliability,  performance  capability,  and  availability  are 
primary  measures  of  system  effectiveness.  Availability  and  its  several  deviations  (eg,  turna¬ 
round  time,  operational  availability,  ready  rate,  etc)  are  directly  proportional  to  the  fault- 
tolerant  characteristics  of  the  system.  A  fault-tolerant  system  is  more  often  operable  at  the 
instant  it  is  needed.  Also,  it  can  be  quickly  maintainable  because  it  has  built-in/designed-in 
fault  detection  and  location  capability.  A  fault-tolerant  system  is  an  important  consideration 
when  it  is  recognized  that  unavailability  of  a  system  at  a  time  of  critical  need  could  result  in 
loss  of  men,  material,  and  tactical  advantage. 

Unless  special  care  is  taken  in  fault-tolerant  system  design,  built-in  redundancy  can 
mask  faults  during  the  early  portion  of  system  life.  As  redundancy  is  exhausted,  the  proba¬ 
bility  of  failure  is  increased.  It  is  critical  that  fault-tolerant  design  include  indication  of  faults 
as  they  occur  so  that  repairs  can  be  made  immediately. 

b.  Logistic  supportability.  Maintenance  requires  skilled  personnel  in  quantities  and 
skill  levels  commensurate  with  the  complexity  of  the  system.  However,  the  balance  of  that 
relationship  cannot  be  maintained.  System  complexity  has  increased  to  the  point  at  which 
the  skill  level  and  quantities  of  maintenance  personnel  for  system  maintenance  cannot  be 
achieved.  A  system  should  be  designed  to  be  fault-tolerant  to  ensure  it  will  continue  to 
operate  even  though  there  has  been  an  internal  failure.  Built-in-test  should  be  designed  so 
that  the  system  can  be  quickly  restored  to  service  by  available  maintenance  and  operating 
personnel  before  another  failure  occurs  and  causes  a  total  failure  of  the  system.  This  is 
particularly  important  since  the  typical  technician  may  be  20  years  old  with  approximately 
40  weeks  of  formal  schooling  and  usually  with  no  more  than  2  years  of  experience. 

c.  Life-cycle  costs.  Critical  to  life-cycle  costs  of  any  system  are  the  labor-intensive 
operations  requiring  training  and  expensive  equipment.  Maintenance  is  a  costly  operation 
and  can  be  significantly  reduced  by  fault  tolerance  and  maintenance  design.  Significant  sav¬ 
ings  in  system  costs  involving  test  equipment,  training,  and  documentation  represent  only  a 
portion  of  the  savings  in  maintenance  time  and  down-time  that  can  be  achieved.  Often  sys¬ 
tem  life-cycle  costs  range  from  3  to  20  times  their  original  procurement  cost. 

The  initial  procurement  cost  must  be  a  constraint  on  system  design.  There  is  al¬ 
ways  a  budget  available  for  systems  purchase;  but  if  the  improvement  in  life-cycle  cost  causes 
the  procurement  cost  to  significantly  increase,  the  improvement  may  not  be  affordable  simply 
because  we  do  not  have  the  funds  available  for  that  initial  purchase. 

1 .5  Potential  of  fault  tolerance.  The  acceptance  and  general  use  of  full  fault  tolerance 
(without  manual  intervention)  can  meet  several  military  requirements: 

a.  Minimize  the  risks  associated  with  component  and  device  failures,  in  which  the 
failures  either  endanger  human  lives  or  threaten  catastrophic  loss  to  the  users.  Examples  are 
systems  for  air  traffic  control,  guidance  and  control  of  vehicles  and  equipment,  and  other 
surveillance,  weapons  control,  and  navigation  sensors. 

b.  Provide  reliable  ( fault  tolerant)  systems  for  environments  that  do  not  allow  access 
for  manual  maintenance,  such  as  space,  underseas,  and  other  locations  (eg,  PHM,  aircraft), 
where  access  is  either  impossible  or  excessively  costly. 

c.  Enable  almost  uninterrupted  operation  of  real-time  systems  in  areas  in  which 
manual  intervention  creates  unacceptable  delays. 

d.  Provide  the  possibility  of  lower  initial  costs  (for  a  given  reliability  goal)  than  a  sys¬ 
tem  that  depends  on  fault  avoidance  (intolerance).  This  can  occur  in  cases  in  which  fault 
tolerance  allows  use  of  less  costlv  components  or  reduces  the  cost  of  designed-in  fault  elimi¬ 
nation  before  the  system  is  delivered. 


e.  Provide  the  possibility  of  reduced  life-cycle  costs  over  systems  with  manual  main¬ 
tenance  requirements.  Fault  tolerance  design  can  reduce  maintenance  to  off-line  replacement 
of  disconnected  elements,  eliminating  costs  associated  with  unavailability  of  systems  between 
the  failure  and  completion  of  repairs. 

f.  Provide  psychological  support  to  system  users  through  knowledge  that  fault  toler¬ 
ance  is  incorporated  in  the  system  on  which  they  depend  for  safety  or  other  benefits. 

The  full  potential  of  fault  tolerance  has  never  been  realized.  This  program  focuses  on 
demonstrating  this  potential. 

1.6  Current  roadblocks.  There  are  several  obstacles  to  cost-effective  application  of  fault 
tolerance  in  military  systems: 

a.  Lack  of  continuity.  Some  fault  tolerance  techniques  were  developed  for  first- 
generation  computers  but  were  discarded  because  the  second-generation  computer  demon¬ 
strated  a  much  higher  reliability  in  semiconductor  and  magnetic-core  components.  In  addi¬ 
tion.  many  solutions  were  not  openly  documented,  causing  much  reinvention  of  techniques. 
Also,  many  of  the  errors  were  repeated  with  subsequent  loss  of  confidence  in  those 
techniques. 

b.  Lack  of  cost/benefit  measures.  Thus  far  there  are  no  general  methods  for  a  con¬ 
venient  quantitative  assessment  of  the  benefits  (in  terms  of  life  cycle  cost  reduction)  of  fault 
tolerance.  The  initial  extra  costs  due  to  the  various  redundancy  techniques  is  more  directly 
evident  and  tends  to  bias  a  large  class  of  users  (who  do  not  have  an  absolute  requirement)  in 
favor  of  systems  without  fault  tolerance. 

c.  Lack  of  specifications  and  acceptance  tests.  The  user  community  at  large  does  not 
have  adequate  knowledge  of  the  properties  and  limitations  of  fault  tolerance.  As  a  conse¬ 
quence.  specifications  for  reliability  are  insufficiently  precise  and  virtually  unverifiable  in  ad¬ 
vance  of  system  use.  For  example,  reliability  requirements  for  a  given  time  interval  do  not 
specify  classes  of  faults  and  do  not  define  what  constitutes  acceptable  recovery.  Also,  the 
MTBF  specifications  do  not  explicitly  deal  with  fault  classes  (eg,  transients  and  design  faults) 
and  recovery  requirements,  and  ignore  the  differences  between  redundant  and  nonredundant 
designs.  Extremely  high  reliability  and  MTBF  predictions  are  sometimes  offered  without 
stating  the  implicit  assumptions  of  a  static  reliability  model  and  a  very  limited  class  of  faults. 
In  contrast,  consider  speed  requirements  in  instructions  per  second,  which  can  be  stated  and 
tested  for  acceptance  very  precisely. 

d.  Fragmentation  of  efforts.  Program  efforts  to  increase  reliability  of  systems  origi¬ 
nate  within  several  disciplines  of  engineering  theory  and  practice.  These  include  system 
architecture,  software  engineering,  testing  and  design  verification,  design  of  data  base  manage 
ment  systems,  computer  networks  and  communication  systems,  component  and  packaging 
engineering,  field  operation  and  maintenance,  and  others.  Although  they  all  have  a  common 
goal,  these  efforts  have  remained  largely  disjointed.  A  definite  lack  of  common  viewpoint 
and  of  systematic  communication  is  evident  at  the  present  lime.  There  is  also  a  gap  between 
the  results  of  theoretical  investigations  and  practical  engineering  solutions  to  fault  tolerance 
problems. 

e.  Inertia  in  the  design  process.  Introduction  of  fault  tolerance  in  system  design  re¬ 
quires  an  early  commitment  and  a  significant  departure  from  the  traditional  evolutionary 
design  of  system  or  subsystem  "product  line"  in  which  compatibility  ot  software  is  usually  a 
dominant  factor.  While  the  number  of  fault  tolerance  techniques  to  serve  as  maintenance  aid 
has  been  increasing,  none  of  the  major  manufacturers  has  announced  a  fulls  fault-tolerant 
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line  of  computers  or  subsystems.  The  only  fault-tolerant  systems  that  have  been  actually  ! 

delivered  were  custom-made  to  special  requirements.  The  military  must  provide  the  impetus  \ 

to  foster  wide  use  of  fault  tolerance.  j 

f.  Resistance  to  potential  impact.  Successful  introduction  of  fault  tolerance  may  ' 

cause  some  de-emphasis  of  several  currently  profit-making  activities.  Development  of  ultra-  i 

reliabile  devices,  maintenance  and  operations  personnel  support,  new  test  equipment  develop-  j 

ment,  and  activities  associated  with  the  a  priori  verification  of  software  are  examples  of  these  ! 

activities.  i 

I 

In  conclusion,  while  most  of  the  above  difficulties  are  common  to  many  disciplines  in 
systems  engineering  and  science,  they  reach  their  greatest  severity  in  the  science  and  imple¬ 
mentation  of  fault  tolerance. 

i 

i 

1.7  Systems  approaches  critical.  On  the  basis  of  the  discussion  of  obstacles,  it  is  clearly  j 

imperative  that  fault  tolerance  be  designed,  evaluated,  and  managed  from  a  total  systems  j 

viewpoint.  All  current  and  proposed  efforts  must  be  identified,  analyzed,  and  integrated  to 

achieve  the  optimum  balance  in  an  R&D  program  plan  as  well  as  in  those  applications  involv¬ 
ing  fundamental  tradeoffs  between  design  requirements,  life-cycle  costs,  and  maintainability. 

Appendix  B  is  a  summary  of  current  and  proposed  fault-tolerant  systems  research  and  studies.  i 

All  decisions  in  specifications  and  design  must  be  carefully  analyzed  to  determine  detrimental  j 

effects  on  other  system  effectiveness  parameters.  It  is  the  intent  of  this  program  plan  to 
develop  understanding  of  all  critical  system  design  parameters  and  their  relationship  to  one 
another.  It  is  apparent  that  the  simplest  and  most  effective  program  will  result  from  a 

systems-oriented  approach.  j 

1.8  R&D  strategy.  Development  of  fault-tolerant  technology  by  the  military  must  be  in 
consonance  with  the  efforts  of  other  government,  industrial,  and  university  organizations.  It 
is  recognized  that  in  a  period  of  austere  budgeting  for  R&D,  the  military  must  plan  for  effec¬ 
tive  use  of  its  budget  to  obtain  the  maximum  return.  Such  a  planning  strategy  is  summarized 
below: 

a.  The  primary  objective  of  this  program  plan  is  the  development  of  fault  tolerance  ^ 

technology  that  can  be  applied  to  all  systems.  Although  it  is  feasible  to  support  the  develop¬ 
ment  of  a  fault-tolerant  capability  in  a  specific  system,  generally  cost,  technical  risk,  and 

other  specific  concerns  of  technology  will  not  allow  specific  development  of  fault-tolerant 
technology  in  a  system.  Unlike  NASA,  which  can  spend  significant  resources  to  achieve  its 
object  in  a  specific  system,  the  military  must  focus  on  developing  technology  to  make  it  avail¬ 
able  for  a  multitude  of  different  systems.  This  can  be  achieved  through  planning  that  focuses 
on  specific  areas  of  technology  while  exploiting  other  research  supported  by  National  Science 
Foundation  (NSF),  National  Aeronautics  and  Space  Agency  (NASA),  and  other  governmental 
and  industrial  organizations.  The  objective  of  this  plan  should  be  the  funding  (seeding)  of 
basic  ideas  which  can  lead  to  larger  projects.  It  should  foster  the  overall  technology  but  use 
major  funding  in  selected  applications. 

b.  Selected  military  applications  which  must  he  addressed  to  meet  military  require¬ 
ments  include  electronics/avionics,  communications,  surveillance,  command  control,  and  then 
more  specific  areas.  Among  the  specifics  include  computer  netting,  human  error  in  systems, 
data  base  fault  tolerance,  fault-tolerant  software,  and  fault/failure  prediction  in  all  systems. 

These  areas  must  be  the  focus  of  attention  in  all  demonstrations  of  capabilities.  In  a  limited 
budget,  the  need  exists  to  keep  the  research  directed  toward  critical  application  areas.  It  is 
also  necessary  to  maintain  a  management  and  communication  approach  that  will  foster  tech¬ 
nology  transfer. 
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c.  Technology  transfer  involves  several  aspects  of  communications  between  the  tech¬ 
nical  and  management  communities.  First,  the  military  must  provide  a  means  of  exchanging 
information  that  will  advance  the  technology.  One  approach  is  the  use  of  a  series  of  technol¬ 
ogy  workshops  that  are  application  (rather  than  uniquely  technology)  oriented.  Also,  the 
military  can  develop,  maintain,  and  distribute  a  series  of  publications  (eg.  a  handbook  on 
built-in  test)  that  reflect  the  baseline  of  the  technology.  Such  a  series  of  state-of-the-art 
publications  has  been  conspicuously  lacking  in  this  area. 

In  summary,  the  R&D  program  in  fault  tolerance  requires  a  serious  commitment  from 
the  military.  It  involves  recognition  of  the  need  for  expertise  both  within  and  outside  the 
military  community  to  provide  continuity  in  the  technology. 

1 .9  Specific  Requirements 

1 .9. 1  Air  Force  fault  tolerance  requirements.  The  US  Air  Force  has  two  different  sets  of 
requirements  for  fault-tolerant  systems,  depending  on  the  availability  of  maintenance  for  the 
system.  One  set  of  requirements  is  represented  by  manned  aircraft  and  the  other  by  un¬ 
manned  spacecraft.  Typical  of  the  first  set  of  requirements  is  the  tactical  aircraft  of  the 
1980s.  This  aircraft  should  be  able  to  operate  out  of  a  bare  base  for  a  30-day  period  with  the 
following  capabilities: 

•  Multiple  (typically  six)  sorties/day 

•  No  delay  in  turnaround  time  due  to  maintenance  actions 

•  No  aborts  due  to  failure  of  electronics 

•  Degraded  performance  after  failures  not  less  than  50'?  of  full-up  performance  in 
terms  of  mission  effectiveness 

•  All  faults  removed  in  a  single  maintenance  period  per  day.  usually  not  more  than  6 
hours 

•  No  tlight-line  test  equipment 

•  Maintenance  performed  with  minimum-skill-level  personnel 

There  will  be  deviations  in  requirements  from  those  listed  above,  depending  on 
whether  the  design  is  for  a  tactical  fighter  aircraft,  a  strategic  bomber,  an  airlift  aircraft,  or 
even  a  piece  of  ground  equipment,  such  as  a  radar.  If  the  requirements  enumerated  above  for 
the  tactical  fighter  can  be  solved,  significant  progress  will  have  been  made  toward  solving  the 
problems  of  reliability  and  maintenance  for  other  manned  systems.  Therefore,  a  reasonable 
approach  to  solving  these  problems  is  to  study  the  tactical  fighter  aircraft  and  translate  the 
solution  for  this  aircraft  to  other  manned  systems.  For  any  of  these  aircraft  the  designer 
must  consider  an  optimum  mix  of  fault  tolerance  for  a  short  period  of  time,  and  quick  fault 
removal  when  maintenance  can  be  performed. 

Requirements  for  unmanned  spacecraft  are  significantly  different  from  those  for 
manned  aircraft.  In  the  case  of  spacecraft,  the  emphasis  is  on  highly  reliable  operation  over  a 
long  period  of  time.  Typical  parameters  are  95'?  probability  of  successful  operation  alter  5 
years  in  space.  In  this  case  toleration  of  faults  is  the  important,  or  driving,  aspect  with  little 
or  no  emphasis  on  fault  removal.  Redundancy  techniques  that  mask  failures  are  acceptable 
in  the  spacecraft  situation,  but  not  for  aircraft  where  detection,  isolation,  and  removal  ot 
faults  are  required  in  order  to  meet  the  long-term  operational  goals  of  the  system. 


1 .9.2  V/STOL  aircraft  requirements.  The  V/STOL  aircraft  represents  an  advanced  concept 
in  naval  weaponry,  and  its  successful  deployment  in  the  1990s  will  require  a  significant  ad¬ 
vance  in  new  technology.  Current  expectations  for  V/STOL  and  its  use  on  surface  effects 
ships,  DD963,  and  other  ships  not  designed  for  carrying  aircraft  will  dictate  significant 
changes  in  operations  and  design.  An  aircraft  which  must  be  supported  aboard  a  ship  other 
than  an  aircraft  carrier  requires  sound,  practical  logistics  support  concepts.  These  new  con¬ 
cepts  present  a  challenge  to  implement  new  fault-tolerant  designs. 

The  Navy  reliability  and  maintainability  requirements  for  V/STOL  are  stringent  and 
far  exceed  current  aircraft  capabilities.  They  involve  approximately  200%  improvement  in 
reliability  and  a  total  maintenance  time  not  to  exceed  10%  of  total  time,  an  exacting  objective 
of  total  maintenance  time  not  to  exceed  72  hours  of  each  month.  It  is  recognized  that  these 
requirements  can  only  be  met  by: 

•  Built-in-test  design  for  fault  detection,  isolation,  and  repair 

•  Extensive  use  of  redundancy  for  improving  reliability  and  increasing  the  interval 
between  both  scheduled  and  unscheduled  maintenance 

•  New  concepts  in  partitioning  for  more  improved  design  between  avionics  accessibility 
and  in-flight  maintenance  capability 

An  expanded  fault  tolerance  technology  program  is  clearly  a  Navy  need  for  the  suc¬ 
cessful  deployment  of  the  V/STOL  in  the  1990s. 

1 .9.3  Command  control  requirements.  The  continuous  and  survivable  operation  of  com¬ 
mand  control  systems  has  been  addressed  primarily  by  the  use  of  redundant  equipment  sup¬ 
ported  by  the  backup  human  element.  The  integration  of  man  and  machines  into  effective, 
highly  reliable,  and  highly  maintainable  systems  is  a  pressing  demand.  As  systems  become 
more  complex,  the  repair  (downtime)  may  increase  as  well  as  the  operational  cost.  In  a 
period  of  decreasing  budgets,  and  with  operations  and  maintenance  cost  increasing,  the  design 
of  reliable,  easily  maintainable  equipment  remains  a  critical  need. 

2.0  Fault  Tolerance  Program  Plan 

2. 1  System  design  and  acquisition  cycle.  The  ultimate  objective  in  the  acquisition  of  a 
system  or  equipment  is  to  achieve  maximum  utility  at  minimum  cost.  The  required  level  of 
reliability,  fault  tolerance,  and  maintainability  in  systems  delivered  to  the  users  should  be 
balanced  against  logistic  support  complexity  and  potential  life-cycle  costs.  Acquisition  of  a 
system  involves  a  series  of  phases.  During  each  phase  the  fault  tolerance  and  other  capabilities 
are  enhanced,  if  appropriate  knowledge,  technology,  and  methodology  are  carefully  applied. 
This  R&D  program  plan  is  based  on  supporting  the  design  and  acquisition  process  in  achieving 
fault  tolerance. 

The  acquisition  process  is  depicted  in  Figure  2  as  a  simple  phase-to-phase  flow  diagram. 
The  transition  points  denote  major  achievement  milestones  which,  if  attained,  signify  the  suc¬ 
cessful  completion  of  the  preceding  phases,  Each  phase  consists  of  several  engineering  and 
management  projects.  These  projects  are  composed  of  several  tasks,  and  are  listed  in  appendix 
A,  Project  Descriptions.  Table  1  establishes  the  correlation  between  the  program  plan  objec¬ 
tives  (paragraph  1.1)  and  the  projects  listed  in  appendix  A. 

2.2  Operational  requirements.  System  mission  and  operational  concepts  are  identified 
and  described  in  the  operational  requirements;  specifically,  the  purposes,  environments,  and 
operational  methods  of  the  warfare  and  support  areas  for  both  the  present  and  foreseeable 
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REQUIREMENTS  STATED  FOR  AVAILABILITY.  RELIABILITY,  FAULT 
TOLERANCE,  ANO  MAINTAINABILITY. 


ANALYSIS  OF  RISKS  ASSOCIATED  WITH  COST,  SCHEDULE,  AND 
TECHNICAL  OBJECTIVES  AND  PARAMETERS.  CRITICAL  AREAS  IDENTI¬ 
FIED  IN  TEST  REQUIREMENTS,  TECHNICAL  PERFORMANCE  MEASURE¬ 
MENT  PARAMETERS,  AND  CRITICAL  MILESTONES. 


VARIOUS  SYSTEM  ALTERNATIVES  ANALYZED  AND  EVALUATED  ON 
BASIS  OF  RISK  ANALYSIS.  SYSTEM  APPROACHES,  AVAILABILITY 
AND  RECOVERY  TECHNIQUES,  AND  CONTROL  PROCEDURES 
DETERMINED.  CONCEPTS  ESTABLISHED.  FEASIBILITY  CONFIRMED 
ANALYTICALLY. 


SYSTEM  PARAMETERS,  CAPABILITIES,  AND  OTHER  TECHNICAL 
OBJECTIVES  DESCRIBED  FOR  CONTRACTOR  DESIGN  AND  DOCUMEN¬ 
TATION  IN  SYSTEM  SPECIFICATIONS,  MILESTONES,  AND  TASK 
REQUIREMENTS  FOR  CONTRACT. 


DESIGN  APPROACH  AND  CONCEPT  OPTIMIZED  BY  TRADEOFF  AND 
CONFORMANCE  TO  SPECIFIED  REQUIREMENTS  VERIFIED 
ANALYTICALLY.  DESIGN  OF  SYSTEM  COMPLETED. 


CONFORMANCE  TO  SPECIFICATIONS,  RESULTING  IN  DEVELOPMENT 
MODELS  AND  TEST  AND  EVALUATION 


PRODUCTION  INITIATED.  CONCEPTS,  HARDWARE,  SOFTWARE, 
PROCEDURES  VERTIFIED  BY  OPERATIONAL  EVALUATION  IOPEVAL). 


CONTINUED  ASSESSMENT,  PROBLEMS  IDENTIFIED  FOR  CORRECTION 


Figure  2.  Phases  of  system  acquisition. 

flit  lire  are  identified.  This  includes  the  operational  needs  and  capabilities  in  a  form  for  guid¬ 
ing  both  research  and  development  over  a  long  period  of  time.  Key  parameters  should  he 
identified  and  performance  goals  and  cost-performance  listed  for  systems.  It  is  important  to 
describe  required  capabilities  by  priority  or  time  urgency,  and  criticality  of  failure. 

Specific  quantitative  requirements  for  fault  tolerance  are  usually  not  determined  from 
a  cursory  review  of  system  operational  requirements.  In  fact,  fault  tolerance  is  inextricably 
tied  to.  or  governed  by.  other  system  effectiveness  parameters.  To  specifically  define  fault 
tolerance  requirements,  it  is  necessary  to  have  a  clear-cut  set  of  definitions  that  can  be  used 
to  derive  fault  tolerance  and  can  ultimately  be  quantitatively  optimized  with  respect  to  other 
operational  factors.  Therefore,  it  is  necessary  to  develop  procedures,  requirements,  and  use 
conditions  for  achieving  the  following  objectives: 

a.  Establish  baseline  requirements  for  fault  tolerance  which  are  amenable  to  design 
interpretation  and  implementation.  There  should  be  a  set  ol  definitions,  terminology,  and 
other  supporting  concepts  provided  for  the  development  of  specifications  and  procurement 
documents. 


B 


Fault  Tolerance  Requirements  Definition  and  Interpretation 
Mission  Availability  Analysis  Methods 
Project  DAIS  Fault  Tolerance  Evaluation 
Relation  of  Requirements  to  Specification 
Reliability/ Fault  Tolerance  Analysis  and  Design  Tools 
Alternate  System  Design  Evaluation 

Fault  Tolerance  Life-Cycle  Cost  Impacts  Maintenance  Model 
Establishment  of  Redundancy  Limits/Tradeoffs 
Theory  of  Testing  Taxonomy 
Reliability -Measure  of  Testability  Concept 
Failure-Fault  Prediction  Technology 
Recovery  Techniques 
Transient  Faults 
Identification  of  Failure  Modes 
Fault-Tolerant  Software 

High-Order  Language  Constructs  For  Fault -Tolerant  Systems 
Analog  Functional  Redundancy 
Functional  Test  Design  Theory' 

Fault  Tolerance  Validation  and  Verification 
Estimation  of  Confidence  Limits  Testing  Large  Logic  Networks 
Command  Control  Fault  Tolerance 
Architectures  For  Availability  Requirements 
Communication  Protocol  For  Fault  Tolerance 
Specification  For  Redundancy  Management 
Standards  and  Fault  Tolerance 
Self-Diagnosing  Design  Techniques 
Redundant  Microcomputers 

Loosely  Coupled  Fault-Tolerant  Computer  Networks 
Fault  Tolerance  Masking  Hazards 
Fault  Tolerance  Design  Handbook 
Design/Development  Toof  (Methodology) 

Acceptance  Testing 
TECHEVAL-OPEVAl  Techniques 

Plan  Objective: 

A.  Clarify  operational  requirements  and  relate  them  to  system  specifications 

B.  Establish  alternate  system  design  method  for  balancing  operational  capabilities  against  life-cycle 


C.  Develop  understanding  ol  system  engineering  relationships  between  reliability,  fault  tolerance, 
repairability.  and  logistics  support. 


b.  Define  a  firm  basis  of  demonstration  test  plans  and  criteria  for  acceptance  testing. 
Development  of  validation  and  verification  plans  begins  with  the  clear  definition  of  require¬ 
ments  in  terms  of  quantifiable  parameters. 

c.  Provide  a  basis  for  establishing  and  applying  appropriate  formal  management  con¬ 
trols  that  can  be  used  at  designated  critical  milestones.  Management  of  systems  acquisition 
with  a  high  technology  requirement  is  important  and  dependent  on  the  quantification  of 
parameters. 

d.  Provide  a  quantifiable,  realistic  baseline  tor  life-cycle  cost  analysis  and  logistic  sup¬ 
port  planning. 

All  the  above  objectives  depend  on  clear  definition  of  requirements  in  quantifiable 
parameters.  Therefore,  the  following  projects  are  required  in  the  area  of  requirements 
definition. 

•  IAULTTOLI  KANCl:  REQUIREMENTS  Dll  INI  HON  AND  INI  E  RPRETATION 

The  purpose  of  this  task  is  the  development  of  a  more  meaningful  method  ot  describ¬ 
ing  maintainability  and  reliability  which  will  result  in  the  effective  application  ol  appropriate 
fault  tolerance  technology.  Current  technology  utilizes  terms  such  as  MTBT.  MTTR.  and 
NORM,  but  in  fact  does  not  consider  degrees  of  tolerance,  on-line  versus  off-line,  man- 
machine  tradeoffs,  etc.  This  project  will  explore  the  best  means  of  describing  reliability  and 
maintainability  requirements  for  greatest  effectiveness. 

•  MISSION  AVAILABILITY  ANALYSIS  METHODS 

This  project  addresses  mission  analysis  tools  and  techniques,  emphasizing  their  rela¬ 
tionship  to  fault  tolerance.  The  project  proposes,  through  several  interrelated  tasks,  to  evalu¬ 
ate  existing  automated  aids  for  fault  tolerance  requirements  analysis,  and  then  to  utilize  these 
tools  in  demonstrating  tradeoff  analysis  in  selection,  degree  of  fault  tolerance,  on-line  versus 
off-line  application,  etc. 

•  PROJECT  DAIS  I  Al  l  I  lOU  RANCI  I  VALUATION 

The  AT  DAIS  program  established  and  designed  to  meet  specific  reliability  goals  The 
project  will  conduct  an  analysis  of  the  DAIS  system  to  determine  its  dice  liveness  in  meeting 
its  readability  goals.  I  his  project  will  determine  the  adequacy  ol  the  spec  ideations,  require 
mails,  and  design  in  the  system  acquisition  process. 

2..'  Kecjuircmcnl/risk  analysis  The  analysis  ot  requirements  to  determine  risks  associated 
with  costs,  schedule,  objectives,  and  technical  parameters  is  lequircd  lor  a  successful  acquisi¬ 
tion  program.  It  should  identity  technical  perlorm.mce  measurement  parameters,  test  require¬ 
ments.  and  critical  milestones.  The  analysis  will  rapine  a  quantifiable  tound.ilion  of  the 
technical  and  physical  characteristics  fora  fault  tolerance  technology  (also  reliability,  main¬ 
tainability.  and  supportability ).  In  order  to  perform  the  analysis,  certain  necessary  automated 
analysis design  tools  will  be  required.  Those  that  are  not  available  must  be  developed  This 
phase  of  the  acquisition  process  is  the  foundation  for  technical  program  planning  and  control. 
Definition  ol  effective  and  total  systems  programs  requires  that  present  analysis  capabilities 
be  strengthened.  The  enhanced  capabilities  are  critical  tor  tr.uleott  ot  obiectives  in  selection 
of  the  concept  formulation.  The  following  broad  RAD  projects  nie  required  bn  development 
of  a  (plan  tillable  foundation  of  design  analysis  Tor  phy  sk  al  and  tec  him  al  c  h.iractcrist  ns  oi 
fault  tolerance  technology. 


I  (. 


•  RELATION  OF  REQUIREMENTS  TO  SPECIFICATION 

This  project,  through  a  series  of  interrelated  subtasks,  will  define,  relate,  and  formulate 
specifications  necessary  for  achieving  a  given  set  of  operational  requirements  and  provide  de¬ 
sign  tools  for  use  in  designing  fault-tolerant  systems.  It  will  relate  the  defined  requirements 
to  an  identified  set  of  specifications  that  are  critical  in  meeting  requirements.  This  project 
will  review  the  process  of  developing  requirements,  performing  risk  and  cost  analysis,  concept 
formulation,  and  design  specification. 

•  RELIABILITY/FAULT-TOLERANT  ANALYSIS  AND  DESIGN  TOOLS 

The  goal  of  this  project  is  to  reduce  the  complexity  of  the  mathematical  model  and 
the  associated  programs  while  retaining  the  ability  to  represent  and  investigate  a  wide  variety 
of  redundant  systems  configurations.  Within  the  area  of  investigation  is  ARIES  (Automated 
Reliability  Interactive  Estimation  System),  which  is  an  interactive  package  of  programs  for 
computer-aided  reliability  analysis  of  fault-tolerant  redundant  systems.  This  project  should 
present  a  unified  viewpoint  of  reliability  modeling  based  on  the  theory  of  finite-state, 
continuous-time  Markov  processes.  The  project  should  develop  a  structured,  integrated 
approach  that  will  be  broadly  accepted  by  the  practitioners  in  the  field. 

•  FAULT  TOLERANCE  LIFE-CYCLE  COST  IMPACTS  -  MAINTENANCE  MODEL 

The  relation  of  fault  tolerance  to  both  acquisition  and  life-cycle  costs  has  not  been 
clearly  understood  and  this  task  is  intended  to  develop  the  relational  structure.  This  task  in¬ 
volves  the  analysis  and  development  of  maintenance  and  support  models  based  on  the  rela¬ 
tional  structure. 

•  REDUNDANCY  LIMITS/TRADEOFFS 

The  objective  of  this  project  is  to  establish  how  much  redundancy  is  required  to  meet 
the  fault  tolerance  objectives  of  system  design.  Graphical  and  analytical  approaches  will  be 
used  to  establish  the  amount  of  redundancy  needed  for  a  particular  design. 

2.4  System  concept  formulation.  This  phase  of  the  system  acquisition  cycle  begins  with 
stated  system  operational  requirements,  mission  objectives,  environmental  factors,  technical 
performance,  and  system  figures-of-merit  as  stipulated  and  proposed,  and  examines  the 
validity,  consistency,  desirability,  and  attainability  of  system  approaches  with  respect  to  cur¬ 
rent  technology,  physical  resources,  human  performance  capabilities,  life-cycle  costs,  and 
other  constraints.  System  functions  and  subfunctions  are  identified  and  analyzed  as  a  basis 
for  identifying  alternatives  for  meeting  performance  and  other  requirements.  Each  function 
and  subfunction  is  allocated  to  each  requirement  by  conducting  selected  analysis,  synthesis, 
and  design  activities.  This  phase  of  the  design  and  acquisition  cycle  should  produce  a  set  of 
system  design  specifications  for  contract.  System  concept  formulation  activities  logically 
partition  into  the  following  areas  for  more  detailed  discussion: 

a.  Functional  definition.  Functional  definition  for  fault  tolerance  will  determine 
modes  of  operation,  reliability  required  for  system  functions  and  subfunctions,  fault  tolerance 
functional  partitioning,  operational  usage  and  support,  and  other  system  functions.  Perform¬ 
ance  requirements  are  established  for  each  function,  and  subfunctions  are  identified.  Timing 
allocation  is  analyzed  and  defined.  Time  requirements  that  affect  mission  success,  safety,  and 
availability  are  derived.  Sufficient  detail  should  be  available  to  allow  specification  of  hard¬ 
ware,  software,  procedural  data,  and  personnel.  All  functions  and  their  distribution  must  be 
traceable  through  analysis  to  the  system  operational  requirement  they  are  designed  to  fulfill. 

b.  System  approaches.  Given  a  functional  description,  various  systems  approaches 
can  be  explored  relative  to  meeting  system  requirements,  costs,  and  other  objective  tradeoffs. 


Sufficient  preliminary  design  and  analysis  to  confirm  and  assure  the  completeness  of  perform¬ 
ance  are  necessary.  In  addition,  the  configuration  and  its  arrangement  of  the  elements  of  sys¬ 
tem  are  explored.  Techniques  for  detection,  isolation,  recovery  and  repair,  and  operation  arc- 
portrayed  in  a  suitable  form  to  illustrate  system  and  item  interfaces,  traceability  between  ele¬ 
ments.  and  documentation  of  basic  source  data  for  project  management  and  control.  This 
phase  is  also  concerned  with  logistics  engineering,  life-cycle  costs,  and  optimization  of  alter¬ 
native  approaches. 

The  following  tasks  are  needed  to  develop  the  system  concept  formulation  of  fault 
tolerance. 

•  THEORY  OI:  TESTING  TAXONOMY 

The  goal  of  this  project  is  to  develop  a  new  understanding  and  application  of  testing 
techniques  to  lault-tolerant  systems.  The  theory  of  testing  has  always  been  an  area  of  confu¬ 
sion  with  no  common  definition.  There  is  uncertainty  as  to  how  much  testing  is  enough  and 
what  is  the  relative  app!' .ability  of  a  given  testing  technique.  With  complex  systems  in  the 
field,  experience  is  indicating  that  life-cycle  costs  are  dominating  procurement  costs,  and  test¬ 
ing  and  maintenance  are  the  major  sources  of  life-cycle  costs  as  well  as  a  limiting  factor  of 
performance  due  to  system  availability. 

•  RELIABILITY  MEASURE  OF  TESTABILITY  CONCEPTS 

The  goal  of  this  project  is  to  formulate  a  fundamental  unit  to  measure  the  testability 
of  circuits,  subsystems,  and  systems. 

A  fundamental  quantifiable  parameter  that  is  a  measure  of  testability  of  the  system 
would  allow  a  hierarchical  structure  to  be  developed.  Such  a  structure  could  identify  areas 
requiring  more  extensive  testing,  the  points  on  partitions  at  which  verification  could  be  diffi¬ 
cult.  It  is  critical  to  the  design  and  acquisition  process  that  a  quantification  of  testability  be 
developed. 

•  I  AIH'RI  I  AL'I  I  PREDICTION  TECHNOLOGY 

This  project  addresses  the  development  of  the  techniques,  concepts,  and  methodolo¬ 
gies  lor  fault/failure  predictions.  It  will  encompass  both  analog  and  digital  fault  prediction 
and  the  system  approaches  necessary  to  exploit  the  capability  of  prediction. 

The  advance  ol  microelectronic  technology  will  permit  the  monitoring  and  analysis  of 
circuit  and  system  parameters  for  fault  and  failure  prediction.  This  capability  will  allow  tre¬ 
mendous  savings  in  maintenance  and  retrofit  costs  in  addition  to  significantly  improving  the 
fault  tolerance  performance  of  systems. 

-  RECOVERY  TECHNIQUES 

The  purpose  of  this  project  is  to  develop  system  design  approaches  for  effective  appli¬ 
cation  ol  recovery  techniques  to  fault-tolerant  systems.  Recovery  techniques  in  fault-tolerant 
systems  tire  implemented  to  compensate  for  detected  faults,  and  these  techniques  are  germane 
to  successful  system  design.  In  their  simplest  form  they  consist  of  voter  circuits,  and  in  the 
more  complex  systems  they  are  extensive  and  complicated  software  programs. 

'  TRANSIENT  EAULTS 

Transient  faults  that  occur  in  electronic  systems  have  been  difficult  to  cope  with  be¬ 
cause  of  their  elusive  nature  I  Ills  protect  will  address  I  he  nature  of  transient  faults  and  the 
test  techniques  that  are  applicable  for  detection. 

transient  faults  are  serious  because  they  introduce  errors  at  the  least  desirable  time, 
which  can  be  catastrophic  In  the  past  such  faults  were  accepted  as  undetectable  and  impos¬ 
sible  to  isolate.  However,  recent  advances  in  tcchnologc  indicate  that  such  laults  can  be  de- 
tee  leal  and  corrected. 


•  IDENTIFICATION  OF  FAILURE  MODES 


The  goal  of  this  project  is  to  identify  and  classify  failure  modes  that  are  likely  to 
occur  in  the  various  classes  of  systems.  A  fault-tolerant  design  must  be  based  on  statistical 
failure  rates.  For  this  reason  failure  rate  data  must  be  collected  and  categorized. 

•  FAULT -TOLERANT  SOFTWARE 

It  is  the  goal  of  this  project  to  develop  a  theoretical  foundation  for  detecting,  isolat¬ 
ing,  and  recovering  from  faults  with  reliable  software. 

Reliable  software  implies  robustness  of  fault  tolerance.  That  is,  it  is  expected  to 
deliver  a  certain  minimum  level  of  services  even  when  faced  with  an  unexpected  or  hostile  en¬ 
vironment  (such  as  hardware  failure  or  bad  data).  Techniques  for  detection  and  isolation  of 
faults  are  classified  into  testing  (analysis  of  responses  of  a  selected  set  of  input  data)  and  veri¬ 
fication  (mathematical  proof  and  constructive  approach). 

HIGH-ORDER  LANGUAGE  CONSTRUCTS  FOR  FAULT-TOLERANT  SYSTEMS 

The  purpose  of  this  project  is  the  analysis  and  development  of  fault-tolerant  language 
constructs  suited  for  inclusion  in  the  DoD  high-order  language  standard. 

This  project  will  address  language  constructs  that  allow  incorporation  of  program  re¬ 
dundancy  into  a  block-structured  program  in  a  well  structured  manner.  This  project  does  not 
address  system  architecture  and  design  methodologies  for  fault-tolerant  programs. 

The  most  effective  method  of  meeting  complex  availability  requirements  and  reducing 
development  and  support  costs  is  to  develop  methodologies  and  tools  for  systematic  incorpo¬ 
ration  of  redundancy  programming  in  a  well  structured  form.  Good  structuring  of  a  fault- 
tolerant  program  is  particularly  important,  since  the  use  of  program  redundancy  increases 
program  size.  Fault-tolerant  programs  have  generally  higher  development  costs  and  higher 
processing  cost.  The  high  processing  cost  of  fault-tolerant  programs  is  due  to  increased  proc¬ 
essor  time  and  increased  storage  space  required  for  executing  redundant  program  components. 

•  ANALOG  FUNCTIONAL  REDUNDANCY 

This  project  will  address  several  approaches  that  will  result  in  the  use  of  functional 
redundancy  in  analog  circuits  and  will  define  structuring  of  analog  fault  tolerance  test  tech¬ 
niques  and  capabilities,  and  also  those  characteristics  of  analog  circuits  that  can  be  exploited 
in  systems.  The  analog  circuits  to  be  considered  will  span  the  entire  known  frequency  range, 
will  be  of  both  high  and  low  power,  and  will  represent  a  complexity  of  functions  rivaling  the 
ability  of  digital  circuits. 

Analog  circuits  provide  a  unique  challenge  to  fault-tolerant  design  since  they  are  par¬ 
ticularly  complex  and  are  far  more  prone  to  failure  than  digital  circuits.  Analog  circuits  often 
require  tight  component  tolerances  to  function  properly  and  tend  to  slowly  degrade.  This 
slow  degradation  presents  a  special  challenge  to  the  designer  of  redundant  circuits. 

•  FUNCTIONAL  TEST  DESIGN  THEORY 

This  project  will  develop  the  theory  of  functional  testing,  from  its  theoretical  founda¬ 
tions  to  principles  of  application  in  systems,  software,  and  hardw  are.  This  program  can  be¬ 
come  the  basis  of  functional  testing  microcomputers  as  well  as  analog  functional  testing. 

Functional  testing  has  often  been  omitted  in  system  design  until  the  last  stages.  This 
leads  to  testing  as  an  add-on.  rather  than  a  built-in.  function.  If  a  good  applied  theory  of  test¬ 
ing  is  developed,  tunctional  testing  can  become  part  of  design  from  the  beginning. 

•  FAULT  TOLERANCE  VALIDATION  AND  VERIFICATION 


Without  verification  procedures  it  will  be  difficult,  if  not  impossible,  for  the  govern¬ 
ment  to  put  a  fault  tolerance  requirement  into  a  system  specification.  There  must  be  a  clearly 
defined  procedure  for  the  government  to  use  in  purchasing  a  system  from  a  contractor;  other¬ 
wise  the  fault-tolerant  specifications  can  only  be  design  goals  and  not  design  requirements. 

•  FSTIMATION  OF  CONFIDENCE  LIMITS  TESTING  FOR  TLS1  INC,  LARGF 
LOGIC  NETWORKS 

The  purpose  of  this  project  is  to  provide  a  methodology  and  means  for  accurately 
determining  confidence  limits  for  the  reliability  of  large  digital  logic  networks,  without  ex¬ 
haustively  exercising  all  possible  input  sequences  or  simulating  all  logical  faults  in  the  network 
The  project  addresses  development,  implementation,  and  demonstration  of  a  logic  simulation 
and  fault  analysis  system  to  serve  as  a  tool  for  determination  of  reliability  confidence  limits 
in  digital  networks  containing  up  to  15  000  elements. 

Faults  will  be  modeled  via  probabilistic  techniques,  and  will  occur  singly  and  in  multi¬ 
ple.  The  system,  when  completed,  should  be  capable  of  reasonably  approximating  the  actual 
operation  of  an  existing  logic  network,  including  occurrence  and  location  of  failures,  when  all 
necessary  processing  and  operating  parameters  are  input  to  the  model.  As  long  as  the  built-in 
redundancy  has  not  been  exhausted,  no  symptom  will  appear  outside  the  module.  However, 
when  the  redundancy  has  been  exhausted  or  overwhelmed  by  a  fault,  module  failure  will  re¬ 
sult.  Separate  detection  and  recovery  functions  are  not  identified  from  outside  the  module. 

•  COMMAND  control  fault  tolfrangl 

The  goal  of  this  project  is  to  develop  a  design  and  evaluation  methodology  for  fault- 
tolerant  command  control  global  networks. 

The  availability  of  the  military  (  ^  systems  on  a  global  basis  is  of  major  importance, 
especially  during  the  periods  preceding  and  during  a  conflict.  A  fault-tolerant  design  could 
contribute  to  the  availability  of  a  system  during  critical  periods. 

Because  of  the  many  sensors  now  available,  the  sophisticated  data  processing,  and  the 
man-machine  interfaces  required,  it  is  critical  that  fast  communications  be  available  within  a 
platform,  among  platforms,  and  on  a  global  basis  to  give  a  commander  critical  information 
when  he  needs  it.  This  is  true  during  the  entire  conflict  scale,  from  a  prehostilities  crisis  to 
global  war,  because  data  acquired  from  local  sensors  as  well  as  from  those  operated  by  shore- 
based  commanders  in  support  of  task  forces  will  be  correlated  and  evaluated  in  near  real  time 
as  inputs  to  a  commander's  decision-making  process. 

■  ARCI1H  I  (  TURI  S  FOR  avail  ability  rfquiri  mi  nts 

The  purpose  of  this  project  is  the  review,  analysis,  and  development  ol  sy stem  archi¬ 
tecture  and  networks  that  will  result  in  increased  lault-tolerant  systems  capabilities.  I  hrougli 
this  project,  system  architecture  tradeoffs  will  be  established  and  applied  to  determine  piacti- 
eal  system  structures  that  meet  all  requirements. 

The  importance  of  this  project  is  reflected  in  the  tact  that  sy  stem  architectures  tor 
meeting  performance  designs  are  in  conflict  with  those  of  lault-tolerant  design  (eg.  perform¬ 
ance  design  dictates  single  centralized  processing  while  lault-tolerant  design  requires  distrib¬ 
uted  processing  and  multichannel  communications). 

•  COMMUNICATION  PROTCX  OL  FOR  FA U I  T  TC ) I  I  R A N(  I 

Bus  protocol  lor  reconfigurability  of  bus  structured  system  architecture  is  necessary 
for  simple  and  cost-effective  system  design.  This  project  addresses  the  requirements  and  spo. 
fications  that  lead  to  standardization  of  Inis  protocol.  It  will  levicw.  analy  ze.  and  catalog 
existing  and  proposed  bus  protocols  and  interface  techniques  to  determine  fault  tolerant 
systems  applicability 


2.5  Specifications.  After  the  system  concept  has  been  formulated,  system  specifications 
need  to  be  written.  The  contractual  requirements  of  a  potential  supplier  require  careful  defi¬ 
nition  to  ensure  that  acceptance  by  the  government  will  be  accomplished  smoothly. 

The  specifications  should  include  the  system  design  requirements  and  goals,  the 
acceptance  procedures,  the  documentation  requirements,  the  major  milestone  requirements, 
and  any  other  data  that  may  be  required  by  a  particular  procurement. 

The  following  projects  are  required  for  fault  tolerance  specifications  development. 

•  REQUIREMENTS  FOR  REDUNDANCY  MANAGEMENT 

The  goal  of  this  project  is  to  develop  a  common  approach  to  fault-tolerant  system  de¬ 
sign  through  redundancy  management.  First  there  will  be  an  analysis  of  fault-tolerant  redun¬ 
dancy  design  techniques  and  management  techniques  will  be  developed. 

•  STANDARDS  AND  FAULT  TESTING 

The  goal  of  this  project  is  to  recommend  changes  to  existing  military  standards  that 
include  fault-tolerant  designs.  The  use  of  standards  is  necessary  for  simplified  and  cost- 
effective  application  of  fault  tolerance  in  military  systems.  This  task  will  be  instrumental  in 
obtaining  a  fault-tolerant  capability  in  all  military  systems. 

2.6  Design  and  development.  Design  is  the  process  by  which  the  conceptual  approach  is 
optimized  through  tradeoffs  and  conformance  to  specifications.  It  involves  analytical  verifi¬ 
cation  and  specific  system  synthesis  and  implementation.  This  phase  of  the  acquisition  cycle 
involves  many  different  areas: 

a.  Translate  specifications  to  hardware/software.  Evolve  a  detailed  design  conforming 
to  specified  fault  tolerance  requirements,  system  fault-tolerant  concept,  and  demonstration 
requirements. 

b.  Conduct  fault-tolerant  analysis  and  design  evaluation,  to  guide  mechanization 

process. 

c.  Assess  fault-tolerant  capability  in  final  design  configuration  and  compatibility  with 
a  specified  fault  tolerance  concept. 

d.  Incorporate  all  specifications  and  tasks  into  an  integrated  systems  product  for  final 
development. 

In  support  of  these  design  areas  the  following  broad,  all-encompassing  R&D  projects 
should  be  accomplished. 

•  SELF-DIAGNOSING  DESIGN  TECHNIQUES 

The  goal  of  this  project  is  to  develop  a  definition  of  general  specifications  for  design 
and  utilization  of  LSI  components  and  modules  in  self-diagnosing  systems. 

This  project  will  study  and  analyze  architecture,  functional  partitioning,  and  module 
and  component  design  features  necessary  for  achieving  self-diagnostic  microprogrammable 
capabilities  in  processors.  The  project  will  result  in  a  set  of  guidelines  for  designing  LSI  com¬ 
ponents  and  modules  that  ensure  testability  of  the  component  and  the  system  employing  the 
component.  Redundancy  levels,  organization  of  functions,  requirements  for  test  points  to  be 
made  accessible,  and  the  procedures  for  using  these  data  to  design  and  implement  a  processor 
that  is  self-diagnosable  down  to  the  replaceable  module  level  will  be  postulated  and  verified. 

•  REDUNDANT  MICROCOMPUTERS 

The  purpose  of  this  project  is  review,  analysis,  and  development  of  redundant  micro¬ 
computer  concepts  for  tault-tolerant  systems  design 


The  advent  of  the  one-chip  microcomputer  has  opened  the  possibility  of  new  approach 
es  to  redundant  computer  configurations  that  exploit  hardware  executives,  simplified  voting 
circuits,  and  reduced  switch  complexities.  The  use  of  redundant  microcomputers  is  now 
considered  cost-effective  and  feasible  for  broad  classes  of  problems.  This  project  will  formal¬ 
ize  the  approaches  to  applying  redundant  computers  through  analysis,  design,  and  verification. 


•  LOOSFLY  COUPLFD  FAULT-TOLERANT  COMPUTER  NETWORKS 

The  objective  of  this  project  is  to  develop  a  theory  that  will  address  the  problems  of 
loosely  coupled  computer  networks.  Loosely  coupled  computers  are  defined  as  those  between 
any  two  of  which  there  is  only  one-way  communication.  Often  a  system  (for  example,  the 
command  control  system)  has  long  distances  between  its  various  subsystems  and,  due  to  these 
long  distances,  the  communication  paths  have  large  delays.  If  a  message  is  not  received  cor¬ 
rectly,  it  is  impractical  to  immediately  notify  the  sender  to  repeat  the  message.  This  con¬ 
straint  on  the  system  presents  new  problems  that  may  require  changes  in  design  in  order  to 
make  the  system  fault-tolerant. 

•  FAULT  TOLL RANCH  MASKING  HAZARDS 

The  goal  of  this  project  is  to  investigate  contemporary  problems  associated  with  re¬ 
dundancy.  in  particular  those  associated  with  masking  (or  static  redundancy ).  and  to  develop 
a  set  of  design  guidelines  that  will  help  designers  avoid  problems  in  designing  with  static 
redundancy. 

•  t  ault  tolfranct:  DHSIC.N  HANDBOOK 

The  summation  of  technological  data  on  fault  tolerance  technology  should  be  con¬ 
tained  in  a  fault  tolerance  design  handbook,  and  this  task  will  develop  the  handbook  to 
assume  the  widespread  dissemination  of  data. 

•  DFSKiN/DF VFLOPMFNT  TOOLS  (METHODOLOGY) 

Design  and  development  of  complex  fault-tolerant  systems  involve  many  procedures, 
techniques,  and  methods  that  can  be  automated.  In  particular,  tools  can  be  automated  for 
generation  of  testv  documentation,  circuit  simulation,  and  other  functions.  This  project  will 
address  the  development  of  the  design  and  development  methodology  and  the  tools  required 
to  support  it. 

2.7  lest  and  evaluation.  System  acquisition  generally  requires  that  an  Initial  Operational 
Test  and  Fvaluation  (IOT&F)  be  conducted  in  order  to  validate  the  reliability  and  maintaina¬ 
bility  characteristics  ol  a  system.  IOT&F  design  requires  that  special  consideration  must  be 
given  to  lault-tolerant  systems,  since  internal  failures  may  be  masked  In  the  designed-in  lault- 
tolerant  characteristics  of  the  system.  There  is  a  requirement  to  account  for.  measure,  and 
document  these  internal  failures.  An  internal  failure  will  degrade  a  system  and  make  it  less 
fault-tolerant  :  however,  it  is  generally  unobservable  from  outside  the  system  Therefore. 
IOT&I  specifications  will  require  special  provisions  for  testing  and  evaluating  tan  1  t-tole t a n t 
systems.  The  following  projects  address  these  problems. 

'  A(  ( T  PTANTI  TFSTINU 

The  goal  of  tins  project  is  to  develop  acceptance  testing  procedures  I o i  lault-tolerant 
sv stems,  l  ault-tolerant  systems  continue  to  operate  normally  even  though  there  has  been  a 
failure.  However,  the  failure  will  have  degraded  the  system.  Therefore,  special  acceptance 
testing  procedures  will  be  required  for  acceptance  testing  of  lault-tolerant  s\stcniv 
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•  TECHEVAL-OPEVAL  TECHNIQUES 

The  technical  and  operational  evaluation  of  fault-tolerance  capabilities  is  critical  to 
the  development  of  new  systems.  Yet.  the  test  and  evaluation  is  difficult  to  accomplish  after 
system  design.  The  purpose  of  this  project  is  (a)  to  access  the  effectiveness  of  T&E  after¬ 
design,  and  (b)  to  indicate  techniques  to  perform  more  effective  fault  tolerance  systems  test 
and  evaluation. 

3.0  Project  Priorities 

3.1  Criteria  for  project  selection.  The  basic  objectives  of  the  program  plan  are  ( 1 )  to 
develop  the  basic  foundation  of  fault  tolerance  technology,  (2)  to  develop  technology  par¬ 
ticular  to  the  major  application/mission  areas,  and  (3)  to  develop  a  technology  transfer.  These 
represent  the  basis  for  the  selection  of  project  priorities.  Each  project  is  critical  to,  and  re¬ 
late';  with,  other  projects  that,  in  total,  meet  the  program  objectives.  It  is  necessary  to  recog- 
niz<.  that  not  all  projects  can  be  accomplished  concurrently.  Additionally,  if  they  are  per¬ 
formed  out  of  sequence,  they  may  not  be  effective  in  developing  a  basic  fault-tolerant  tech¬ 
nology,  even  if  the  research  has  been  performed  well.  If  the  time  priority  is  not  followed, 
much  of  the  research  will  need  to  be  repeated  at  a  later  date. 

The  priority  list  is  also  important  in  the  selection  of  tasks  during  periods  when  funding 
is  austere.  Progress  can  be  made  toward  the  final  objective  despite  funding/fiscal  variations. 

3.2  Priority  list.  Table  2  is  a  list  of  projects,  indicating  recommended  priority  for 
accomplishment. 
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TABLE  2.  PROJECT  PRIORITY. 


Project 


I.  OPERATIONAL  REQUIREMENTS 

1 .  Fault  Tolerance  Requirements  Definition  and  Interpretation 

2.  Mission  Availability  Analysis  Methods 

3.  Project  DAIS  Fault  Tolerance  Evaluation 

II.  REQUIREMENTS/RISK  ANALYSIS 

1 .  Relation  of  Requirements  to  Specification 

2.  Reliability/Fault  Tolerance  Analysis  and  Design  Tools 

3.  Alternate  System  Design  Evaluation 

4.  Fault  Tolerance  Life-Cycle  Cost  Impacts  Maintenance  Model 

5.  Establishment  of  Redundancy  Limits/Tradeoffs 

III.  SYSTEM  CONCEPT  FORMULATION 

1 .  Theory  of  Testing  Taxonomy 

2.  Reliability-Measure  of  Testability  Concept 

3.  Failure-Fault  Prediction  Technology 

4.  Recovery  Techniques 

5.  Transient  Faults 

b.  Identification  of  Failure  Modes 

7.  Fault-Tolerant  Software 

8.  High-Order  Language  Constructs  For  Fault-Tolerant  Systems 

4.  Analog  Functional  Redundancy 

10.  Functional  Test  Design  Theory 

1 1 .  Fault  Tolerance  Validation  and  Verification 

1 2.  Estimation  of  Confidence  Limits  Testing  Large  Logic  Networks 

13.  Command  Control  Fault  Tolerance 

14.  Architectures  For  Availability  Requirements 

15.  Communication  Protocol  For  Fault  Tolerance 

IV  SPECIFICATIONS 

1.  Specification  For  Redundancy  Management 

2.  Standards  and  Fault  Tolerance 

V.  DESIGN 

I.  Self-Diagnosing  Design  Techniques 
2  Redundant  Microcomputers 

3.  Loosely  Coupled  Fault-Tolerant  Computer  Networks 

4.  Fault  Tolerance  Masking  Ha/ards 

5.  Fault  Tolerance  Design  Handbook 

h.  Design/ Development  1  ools  ( Methodology ) 

VI.  TEST  AND  EVALUATION 

1  Acceptance  Testing 

2  TFT  HI  VAl  -OPI  VAI  Techniques 

Priority:  I  Immediate  need 

2  Important 

3  less  critical 
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FAULT  TOLERANCE  REQUIREMENTS  DEFINITION  AND  INTERPRETATION 


1 .0  Introduction  (objective).  The  purpose  of  this  task  is  the  development  of  a  more 
meaningful  method  of  describing  maintainability  and  reliability  which  will  result  in  the 
effective  application  of  appropriate  fault  tolerance  technology.  Current  terminology  utilizes 
terms  such  as  MTBF,  MTTR,  and  NORM,  but  in  fact  does  not  consider  degrees  of  tolerance, 
on-line  versus  off-line,  man-machine  tradeoffs,  etc.  This  project  will  explore  the  best  means 
of  describing  reliability  and  maintainability  requirements  for  greatest  effectiveness. 


2.0  Approach.  This  task  will  compile  a  set  of  terminology  and  relate  the  terms  to  fault- 
tolerant  parameters. 


3.0  Scope.  This  task  is  limited  only  to  requirements  definitions.  It  must  be  integrated 
with  other  tasks  on  risk  analysis  and  fault  tolerance  specifications. 


4.0  References 


5.0  Background 


MISSION  AVAILABILITY  ANALYSIS  METHODS 


1.0  Introduction  (objective).  The  force  structure  analyses  upon  which  the  military  bases 
its  major  procurement  decisions  involve  force  effectiveness  calculations  that  generally  as¬ 
sume  command  control  functions  are  carried  out  to  perfection  with  perfectly  operating 
equipment.  However,  in  practice,  when  equipment  fails,  redundancy  may  or  may  not  exist, 
and  the  resulting  performance  is  far  from  ideal.  A  good  mission  exercising  capability  must 
be  provided  to  exploit  the  development  of  fault-tolerant  technology  in  command  control 
and  other  systems.  The  nature  of  redundancy,  its  requirements  and  relationship  to  mission 
availability,  and  its  dependence  on  new  technology  must  be  understood.  Tools  must  be 
provided  to  achieve  automated  assistance  in  availability  analysis.  Particularly  important  in 
these  analyses  is  the  impact  of  human  errors  that  often  may  affect  combat  readiness. 

2  0  Approach.  This  project  will  consist  of  the  interrelated  tasks  listed  below. 

2-1  Evaluation  of  automated  aids  for  fault  tolerance  requirements  analysis.  Various 
existing  programs  will  be  compared,  advantages  and  disadvantages  listed,  and  recommenda¬ 
tions  for  mission  availability  analyses  programs  presented. 

2.2  Experimental  mission  availability  analysis.  This  task  will  utilize  the  tools  developed 
in  paragraph  2. 1  above  and  demonstrate  tradeoff  analysis  in  the  selection,  degree  of  fault 
tolerance,  on-line  versus  off-line,  role  of  the  human,  etc. 

3.0  Scope.  This  project  will  address  only  the  mission  analysis  tools  and  techniques  and 
their  relationship  to  fault  tolerance  requirements. 

4.0  References 

•  Project  TRANSIM,  A  Ten-Year  Progress  Report,  RR  O’Neill  and  AM  Feiler, 
UCLA-ENG-7448  August  74.  contract  N00014-69-A-0200-4052 

•  Carrier  Aircraft  Support  Effectiveness  Evaluation  (CASEE),  A  Report, 
NAVAIRSYSCOM-A1R-52022,  2  June  1976 

•  ANALOG  ATG  Research  at  Grumman,  HH  Schrieber.  24  March  1977.  to  be 
presented  at  the  National  Aeronautical  and  Electronics  Conference 

•  A  Study  of  the  Recoverability  of  Computing  Systems,  Merlin,  PM  Univ  Micro¬ 
films.  Ann  Arbor,  MI,  no  75-1 1026,  Univ  of  California,  Irvine 

•  Recoverability  of  Modular  Systems,  Merlin,  PM.  Farber,  DJ.  Proceedings  of  the 
ACM  Sigcomm/Sigops  Interprocess  Communications  Workshop,  p  51-56.  24-25 
March  1975.  ACM,  NY 

5.0  Background.  Military  systems  such  as  a  Navy  task  force  or  aircraft  carrier  perform¬ 
ance  in  a  given  mission  will  experience  different  decision-making  situations  that  are  depend¬ 
ent  on  the  availability  of  different  equipment.  Current  methods  of  analysis  need  to  be 
supplemented  by  automated  methods  based  on  well  formulated  models. 


PROJECT  DAIS  FAULT  TOLERANCE  EVALUATION 


1 .0  Introduction  (objective).  The  AF  DAIS  program  was  established  and  designed  to 
meet  specific  reliability  goals.  The  project  will  conduct  an  analysis  of  the  DAIS  system  to 
determine  its  effectiveness  in  meeting  its  reliability  goals.  This  project  will  determine  the 
adequacy  of  the  specifications,  requirements,  and  design  in  the  system  acquisition  process. 

2.0  Approach.  This  task  will  involve  analyzing  test  data  and  comparing  these  data 
against  specification  and  requirements. 

3.0  Scope.  The  task  must  be  conducted  from  the  point  of  view  of  the  program  manager 
interested  in  meeting  his  project  goals. 

4.0  References 


5.0  Background 


RELATION  OF  REQUIREMENTS  TO  SPECIFICATION 


1.0  Introduction  (objective).  This  project  will  define,  relate,  and  formulate  the  specifi¬ 
cations  that  are  necessary  to  achieve  a  given  set  of  operational  requirements.  It  will  identify 
a  lexicon  for  stating  the  operational  requirements  for  availability.  It  will  relate  these  re¬ 
quirements  to  an  identified  set  of  specifications  that  are  critical  to  meeting  the  require¬ 
ments.  This  program  will  review  the  process  of  developing  requirements,  performing  risk 
and  cost  analysis,  concept  formulation,  and  design  specification. 

2.0  Approach.  The  following  tasks  are  to  be  performed : 


2.1  Operation  requirements  definition.  This  task  will  define  a  set  of  operational  require¬ 
ments.  The  lexicon  will  include  a  degree  of  fault  tolerance,  operational  procedures,  and 
other  specific  requirement  description. 

2.2  Specification  definition.  This  task  will  define  a  set  of  specifications  for  fault- 
tolerant  design  that  can  be  related  to  the  operational  requirements,  and  the  system  concept 
formulation  process. 

2.3  System  concept  formulation  definition.  This  task  will  address  the  definition  of 
three  areas:  operating  procedures,  system  approaches,  and  availability  and  recovery.  The 
importance  of  these  areas  to  unambiguous  specifications  for  fault  tolerance  is  recognized. 
Each  of  these  areas  must  be  addressed  separately  and  then  all  must  be  addressed  collectively. 

3.0  Scope.  This  project  should  relate  the  requirements-specification  relationships  to  the 
military  procurement  procedures.  This  will  include  all  aspects  from  initial  specification  to 
the  retirement  of  the  system. 

4.0  References 

•  A  Fault-Tolerant  Estimator  for  Redundant  Systems.  Broen.  RB,  IEEE  Transac¬ 
tion  Aerosp  and  Electron  Sypt,  vol  AES-1  1,  no  6,  p  1 28 1-1  285,  November  1 975, 
McDonnell  Aircraft  Co 

•  Approaches  to  Reliable  Computing,  Avizienis,  A,  Sigplan-not,  vol  10,  no  6,  p  458- 
464,  June  1975,  UCLA 

5.0  Background.  There  has  been  work  done  on  the  theory  of  fault  tolerance,  but  theo¬ 
retical  work  has  not  been,  except  possibly  in  a  few  cases,  translated  into  system  specifica¬ 
tions  so  that  a  contractor  can  use  them  to  design  a  system.  This  project  will  provide  a  step 
in  the  direction  of  translating  theory  into  practice. 
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RELIABILITY /FAULT  TOLERANCE  ANALYSIS  AND  DESIGN  TOOLS 


1 .0  Introduction  (objective).  The  goal  of  this  project  is  to  reduce  the  complexity  of  the 
mathematical  model  and  the  associated  programs  while  retaining  the  ability  to  represent 
and  investigate  a  wide  variety  of  redundant  configurations  of  systems. 

2.0  Approach.  A  number  of  different  analysis  tools  need  to  be  evaluated.  Within  the 
area  of  investigation  is  ARIES  (Automated  Reliability  Interactive  Estimation  System), 
which  is  an  interactive  package  of  programs  for  computer-aided  reliability  analysis  of  fault- 
tolerant  redundant  systems.  This  project  should  present  a  unified  viewpoint  of  reliability 
modeling  that  is  based  on  the  theory  of  finite-state,  continuous-time  Markov  processes.  The 
project  must  develop  a  structured,  integrated  approach  that  will  be  broadly  accepted  by  the 
practitioners  in  the  field. 

3.0  Scope.  This  project  must  be  considered  in  relation  to  other  design  and  analysis 
tools,  and  every  attempt  should  be  made  to  integrate  the  analysis  and  design  tools  into  a 
larger  set  of  programs  which  can  be  used  in  system  analysis  and  design. 

4.0  References 

•  AIRES  An  Automated  Reliability  Estimation  System  for  Redundant  Digital 
Structures.  Ying-Way  Ng  and  Algirdas  Avizienis,  Proceedings  1977  Animal  Relia- 
bility  and  Maintainability  Symposium,  p  108-1  13 

•  The  Influence  of  Software  Structure  on  Reliability.  Parnas,  DL.  Sigplan  Not.  vol 
10.  no  6,  p  358-362.  June  1975,  Tech  Univ,  Darmstadt, Germany 

•  Survey  of  Computer  Reliability  Studies,  McCluskey,  EJ.  Ogus.  RC.  Electro- 
Technol  (India)  vol  19,  no  4,  p  82-95,  December  1975,  Stanford  University 

•  On  Balancing  Hardware  Firmware  for  Designing  a  Fault-Tolerant  Computers 
Series,  Courtois,  B,  Savcier,  G.  Micro  8:  Workshop  on  Microprograms.  8th  Annu¬ 
al  Proc.  Chicago,  IL,  p  1-5,  September  21-23,  1975,  New  York,  NY  195 

•  Reliability  and  Coverage  Analysis  of  Nonrepayable  Fault-Tolerant  Memory  Sys¬ 
tems,  Cox,  GW,  Carroll.  BD.  Final  Technical  Report,  I  July  1976.  Auburn  Univer¬ 
sity.  AL,  contract  NAS8-26930 

5.0  Background.  Critical  to  the  specification  and  design  of  fault-tolerant  systems  is  the 
availability  of  analysis  tools  for  the  estimation  of  reliability  of  fault-tolerant  structures.  I  he 
diversity  of  possible  redundant  structures  complicates  the  problem  of  rcliabilitv  and  availa¬ 
bility  assessment.  Reliability  modeling  is  one  of  the  principal  tools  for  reliability  prediction 
of  redundant  structures.  Because  of  the  extensive  computing  involved  in  making  an  esti¬ 
mate  for  a  single  set  of  parameters,  mathematical  models  of  redundant  ( analog  and  digital  I 
systems  are  practically  useful  to  designers  only  when  they  are  automated  in  the  form  ol  a 
computer  program.  Given  such  capability,  the  designer  (as  well  as  the  program  manager l 
can  generate  numerous  reliability  predictions  and  thus  explore  the  sensitivity  ol  the  pro 
posed  design  with  respect  to  changes  in  various  designer-controlled  structural  paiameters. 

An  interactive  capability  allows  new  sets  of  parameters  to  be  entered  at  some  compmei 
terminal  and  can  result  in  an  extensive  exploration  ol  design  alternatives 


ALTERNATE  SYSTEM  DESIGN  EVALUATION 


1 .0  Introduction  (objective).  This  project  will  develop  the  rationale  and  procedures  lor 
evaluating  the  relative  cost-effectiveness  of  alternate  fault-tolerant  system  designs.  System 
operational  capabilities  will  be  assigned  relative  values  and  the  cost  of  providing  the  various 
combinations  of  the  capabilities  will  be  compared  to  their  value  to  the  users. 

2.0  Approach.  A  way  to  relate  value  to  the  user  and  life-cycle  cost  of  a  system  needs  to 
be  developed.  This  will  give  a  measure  of  relative  cost-effectiveness  between  alternate  de¬ 
signs.  It  is  possible  that  gaming  theory  and  linear  programming  may  be  necessary  to  arrive 
at  the  most  cost-effective  systems  with  a  given  funds  constraint.  Once  the  basic  evaluation 
framework  is  designed,  it  may  be  possible  to  write  a  computer  program  to  do  the  evaluation 
of  any  system. 

3.0  Scope.  This  project  will  combine  existing  data  and  methodologies  with  new  proce¬ 
dures  to  measure  the  cost-effectiveness  of  alternate  system  designs. 

4.0  References 

•  Fault-Tolerant  Computing:  An  Introduction,  A  Avizienis,  Computer  Science 
Department,  University  of  California,  Los  Angeles,  January  22,  1977 

•  A  Fault-Tolerant  Spacecraft,  Gilley.  GC,  Digest  of  the  1972  International  Sympo¬ 
sium  on  Fault-Tolerant  Computing,  p  105-109,  June  19-21,  1972 

•  Theory  of  Games  and  Statistical  Decisions,  D  Blackwell  and  MA  Girschick,  John 
Wiley  and  Sons,  1954 

5.0  Background.  The  government  needs  a  procedure  for  evaluating  the  cost- 
effectiveness  of  alternate  fault-tolerant  system  designs  that  includes  the  many  constraints 
that  surround  government  procurement.  If  this  procedure  is  developed,  a  clearer  direction 
of  effort  will  guide  both  the  government  and  contractors  in  development  of  better  fault- 
tolerant  designs. 
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FAULT  TOLERANCE  LIFE-CYCLE  COST  IMPACTS  -  MAINTENANCE  MODEL 


1.0  Introduction  (objective).  The  relation  of  fault  tolerance  to  both  acquisition  and 
life-cycle  costs  has  not  been  dearly  understood  and  this  task  is  intended  to  develop  the 
relational  structure.  This  task  involves  the  analysis  and  development  of  maintenance  and 
support  models  based  on  the  relational  structure. 

2.0  Approach.  Develop  a  set  of  relational  mathematical  models  for  performing  cost, 
availability,  and  reliability  studies  of  systems  using  fault-tolerant  techniques. 

3.0  Scope.  The  task  must  be  conducted  from  the  point  of  view  of  the  program  manager 
and  system  engineer  who  is  faced  with  the  decision  on  whether  to  incorporate  fault  toler¬ 
ance  capabilities. 

4.0  References 


5.0  Background 


ESTABLISHMENT  OF  REDUNDANCY  LIMITS/TRADEOFFS 


1 .0  Introduction  (objective).  Tiie  objective  of  this  project  is  to  establish  how  much 
redundancy  is  needed  to  meet  the  fault  tolerance  objectives  of  the  system  design. 

2.0  Approach.  The  approach  will  be  to  develop  a  graphical  and  analytical  method  to 
determine  the  amount  of  redundancy  needed  to  meet  the  system  design  requirements  for 
fault  tolerance.  This  will  include  optimization  techniques  which  will  show  the  tradeoffs 
between  alternate  designs. 

3.0  Scope.  This  project  will  encompass  all  aspects  of  fault-tolerant  system  design. 

4.0  References 

•  Theory  of  Fault-Tolerance  —  1976  Annual  Report,  vol  I  and  II,  Jack,  LA,  et  al, 
ONR  contract  N0014-75-C001 1,  December  7,  1976 

•  Fault  Tolerance  in  Galois  Trees  -  An  Algorithm  for  Detection  and  Location  of 
Stuck-At  Type  Errors  in  Trees  of  Galois  Linear  Modules,  Marver,  JM,  June  1975 

•  Reliability  Modeling  and  Analysis  of  General  Modular  Redundant  Systems, 
Mathur.  FP.  Desousa,  PT,  IEEE  Transaction  Reliab,  vol  R-24,  no  5,  p  296-299, 
December  1975.  University  of  Missouri 

5.0  Background.  There  is  a  need  to  establish  design  guidelines  for  the  amount  of  redun 
dancv  used  to  meet  a  given  fault  tolerance  requirement. 


THEORY  OF  TESTING  -  TAXONOMY 


1.0  Introduction  (objective).  The  goal  of  this  project  is  to  develop  a  new  understanding 
and  application  of  testing  techniques  to  fault-tolerant  systems.  The  design  process,  tools, 
and  techniques  should  be  identified.  In  addition  the  impact  of  life-cycle  costs  should  be 
addressed. 

2.0  Approaeh.  The  top-down  approach  requires  a  foundation  in  several  areas. 

2.1  Measure  of  testability.  The  measure  (or  measures)  should  provide  a  quantitative 
assessment  of  the  desired  testability.  It  requires  both  basic  and  applied  research  in  analog, 
digital,  and  system  testability.  It  would  be  directed  at  determining  how  much  testability  is 
possible  and  what  the  incremental  costs  are  for  a  given  technique.  A  framework  would  be 
developed  to  provide  a  standard  way  of  specifying  the  degree  of  testability,  isolation,  main¬ 
tainability,  and  reliability  of  a  system. 

2.2  Classification  of  techniques.  Although  there  are  presently  a  large  number  of  test- 
ware  techniques  available,  little  or  no  classification  of  these  techniques  exists  for  fault  toler¬ 
ance  or  maintainability  design.  A  comprehensive  taxonomy  would  substantially  improve 
application  of  current  knowledge  and  refine  existing  techniques. 

New  research  in  techniques  will  be  needed  to  keep  abreast  of  advances  in  compo¬ 
nent,  device,  and  systems  technology.  A  structure  taxonomy  to  meet  changing  demands  is 
critical. 

2.3  System  engineering  approaches.  Designing  testable  systems  requires  a  quantitative 
theory  that  allows  the  system  engineering  approach  to  tie  fragmented  design  issues  and  tech¬ 
niques  together.  The  theory  of  testing  should  provide  an  overall  structure  for  assuring  that 
correctness  and  verification  can  be  obtained.  Integration  of  testware  must  be  accomplished 
by  application  of  the  theory  during  the  system  development  cycle. 

2.4  Design  tools.  The  tools  must  provide  the  mechanism  that  will  allow  theory  to  be 
applied.  The  tools  will  support  the  program  manager  and  designer  during  the  design  and 
acquisition  cycle  of  the  system.  This  would  provide  automated  documentation,  simplified 
design,  and  time/cost-effective  application  of  testware  techniques. 

3.0  Scope.  This  project  is  the  structure  that  provides  the  framework  for  a  large  number 
of  testing  projects.  Its  intent  is  to  clearly  establish  a  baseline  for  testing. 

4.0  References 

•  Theory  ot  Fault-Tolerance,  vol  I  and  II.  Larry  Jack.  Honeywell  Report,  contract 
N001 4-75-C-00 1  I.  December  1  ‘>7 (. 

•  Fault  Detection  in  Redundant  Circuits,  Friedman.  AD,  D  LL  Transactions  on 
Electronic  Computers,  vol  EC- 1 6(  1 ) 

•  Current  Research.  McCluskey.  EJ.  Wakerly.  John  F.  Ogus.  RC.  Technical  Report 
no  100,  October  1  ‘>75.  (  enter  for  Reliable  Computing.  Stanford  University 


5.0  Background.  The  theory  ot  testing  lias  always  been  an  area  of  contusion  with  no 
common  definition,  uncertainty  as  to  how  much  testing  is  enough,  and  many  questions 
relative  to  testing  technique  applicability.  The  military  needs  to  formulate  the  theory  of 
testing  and  support  this  theory  with  a  formal  structure.  The  theory  should  result  in  new 
understanding  and  application  of  testing  techniques.  Among  the  areas  that  can  benefit  from 
a  vigorous  theory  of  testing  are  specifications,  design  process,  tools,  and  techniques. 

Until  recently,  system  testing  and  support,  much  like  software,  have  been  consid¬ 
ered  to  be  of  secondary  importance  in  comparison  to  the  major  costs  of  hardware  procure¬ 
ment.  With  complex  systems  in  the  field,  experience  indicates  that  life-cycle  costs  are  domi¬ 
nating  procurement  costs.  Testing  and  maintenance  are  the  major  sources  of  life-cycle  costs, 
as  well  as  a  limiting  factor  for  performance  due  to  system  availability.  Brute-force  applica¬ 
tion  of  test  techniques  is  now  recognized  as  generally  ineffective  and  costly,  and  a  top-down 
methodology  approach  to  the  design  and  application  of  testing  systems  with  minimal  cost 
impact  is  required. 


RELIABILITY-MEASURE  OF  TESTABILITY  CONCEPT 


1.0  Introduction  (objective).  This  project  will  formulate  a  fundamental  unit  to  measure 
the  testability  of  circuits,  devices,  subsystems,  and  systems.  This  formulation  must  be  based 
on  the  foundations  of  reliability  (Markov  models),  circuit  sensitivity,  controllability  and 
observability,  and  other  fundamental  concepts  which  are  involved  in  measuring  the  com¬ 
plexity,  time,  and  effectiveness  of  testing. 

2.0  Approach.  This  project  involves  a  complicated  analysis  of  many  different  disciplines 
and  the  program  includes  many  different  tasks. 

2.1  Measure  of  testability.  This  task  will  involve  the  determination  of  a  measure  of 
testability  based  on  review,  analysis,  and  development  involving  several  potential  concepts 
for  reaching  a  testability  measure.  Included  are: 

a.  Sensitivity  matrices.  In  all  systems  and  circuits  (analog  or  digital)  it  is  possible  to 
compute  a  sensitivity  matrix  which  can  be  related  to  certain  design  parameters  including  a 
measure  of  testability.  The  sensitivity  matrix  must  be  investigated  as  a  possible  conceptual 
foundation  that  can  become  a  basis  of  testability. 

b.  Controllability/observability  based  on  optimal  control.  This  task  will  address  a 
measure  of  testability  based  on  optimal  control  concepts  including  the  controllability,  ob¬ 
servability,  and  identifiability  principles.  This  conceptual  investigation  would  utilize  exist¬ 
ing,  proved  engineering  foundations  and  extend  them  to  the  problem  of  measuring 
testability. 

c.  Integrated  measures  of  testability.  This  task  will  address  the  task  of  developing  a 
procedure  for  giving  a  system  a  figure-of-merit  for  testability.  This  figure-of-merit  should  be 
formulated  so  that  the  contribution  of  each  subsystem  is  easily  identified  and  compared  to 
the  other  subsystems.  This  will  identify  which  subsystems  have  the  most  impact  on  overall 
system  testability. 

3.0  Scope.  This  task  will  have  impact  on  the  fault  tolerance  of  the  system  under  consid¬ 
eration.  For  this  reason,  the  results  of  this  study  should  be  in  a  form  that  can  be  useful  in 
determining  the  fault  tolerance  of  a  system. 

4.0  References.  See  standard  texts  for  the  definitions  of  controllability  and 
observability. 

5.0  Background.  A  fundamental  quantifiable  parameter  that  is  a  measure  of  testability 
of  the  system  would  allow  a  hierarchical  structure  to  be  developed.  Such  a  structure  could 
identify  the  areas  requiring  more  extensive  testing,  the  points  on  partitions  at  which  verifica¬ 
tion  could  be  difficult.  It  is  critical  to  the  entire  design  and  acquisition  process  that  a  quan¬ 
tification  of  testability  be  developed. 
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FAILURE-FAULT  PREDICTION  TECHNOLOGY 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  the  development  of  the 
techniques,  concepts,  and  methodology  for  failure-fault  predictions.  This  project  will  en¬ 
compass  both  analog  and  digital  fault  prediction.  It  will  also  address  the  system  approaches 
necessary  to  exploit  the  capability  of  prediction. 

2  0  Approach.  A  wide  range  of  tasks  is  required  in  various  areas. 


i  2.1  Identification  and  classification  of  prediction  capabilities.  This  task  will  determine 

|  which  application  circuits  and  systems  can  benefit  from  a  failure  and  fault  prediction  capa- 

|  bility.  Emphasis  will  be  on  high-frequency  analog  circuits  and  systems,  particularly  in  sen- 

j  sors  and  other  high-power  equipment. 

i 

■  2.2  Prediction  algorithms  for  analog  circuits.  The  task  must  develop  a  set  of  prediction 

|  algorithms  and  a  theoretic  foundation  for  their  application  in  achieving  fault  prediction 

>  capabilities.  It  will  involve  the  use  of  inherent  circuit  models  and  structures,  as  well  as 

I  heuristic  algorithms  to  evolve  into  practical  prediction  algorithms. 

I 

►  2.3  Feasibility  ot  fault  prediction  in  digital  circuits.  This  task  will  determine  the  feasible 

approaches  to  fault  prediction  in  digital  circuits. 

3.0  Scope.  This  project  will  explore  the  feasibility  of  fault  prediction  and  will  require  a 
research  approach  that  is  based  on  all  the  foundations  of  test  technology.  This  project  must 
include  demonstration  systems  to  show  feasibility. 

4.0  References 

•  Fault  Analysis  in  Affine  Sequential  Circuits,  Seaks,  R,  Proceedings  of  the  1976 
Conference  on  Information  Sciences  and  Systems,  p  227-232,  Johns  Hopkins  Uni¬ 
versity,  March  1976  (with  S  Sangari) 

•  An  Experimentation  in  Fault  Prediction  II.  Seaks,  R.  Proceedings  of  the  IEEE 
AUTOTESTCON,  November  1977,  Arlington,  Texas,  p  53  (abstract  only) 

■  Fault  Prediction  Towards  a  Mathematical  Theory.  Seaks,  R,  Rational  Fault 
Analysis.  New  York.  Marcel  Dekker.  Inc  (with  LTung  and  SR  Liberty,  to  appear) 

■  A  Functional  Approach  to  Fault  Analysis.  Sacks,  R.  Rational  Fault  Analysis.  New 
York.  Marcel  Dekker.  Inc  (with  MN  Ransom,  to  appear) 

■  Fault  Tolerant  Computing  An  Overview.  Avi/ienis.  A.  IEEE  Computer,  vol  4. 
p  5-N.  January  February  1971 

5.0  Background.  The  advance  of  microelectronic  technology  will  permit  the  monitoring 
and  analysis  of  circuit  and  system  parameters  for  fault  and  failure  prediction.  This  capa¬ 
bility  will  allow  tremendous  savings  in  maintenance  and  retrofit  costs  and  significantly 
improve  the  fault  tolerance  performance  of  systems. 


RECOVERY  TECHNIQUES 


1 .0  Introduction  (objective).  It  is  the  purpose  of  this  project  to  develop  approaches  for 
effective  application  of  recovery  techniques  to  fault-tolerant  systems.  The  fault-tolerant 
system  concepts  should  be  identified  and  the  philosophies,  protocol,  and  implementation 
problems  associated  with  fault  recovery  should  be  addressed.  In  addition,  costs,  tools,  and 
techniques  for  fault  tolerance  recovery  approaches  should  be  identified. 

2.0  Approach.  This  project  will  involve  various  levels  (hierarchy)  of  recovery  techniques 
from  the  component  and  device  levels  through  the  system  level. 

2. 1  Component  and  device  voting  techniques.  This  task  will  consider  the  technical 
issues  and  technology  involved  in  voting  and  other  component  fault  recovery  techniques. 

2.2  Functional  fault  recovery.  Often  faults  can  be  detected  in  functional  form  and 
recovery  is  made  to  compensate  for  this  functional  failure.  Functional  faults  are  detected  at 
the  system  level,  although  recovery  can  be  accomplished  at  the  functional  level,  subsystem 
level,  and  system  level  and  either  in  hardware  or  software  or  both.  This  task  will  develop 
the  technology  required  for  designing  fault  recovery  techniques  including  specifications, 
techniques,  and  tools. 

2.3  Software  recovery  techniques.  Software  recovery  algorithms  are  used  to  reconfigure 
at  the  subsystem  and  system  level.  This  task  will  analyze  and  develop  the  theory  of  system 
recovery  techniques  using  software.  It  will  determine  their  performance  and  cost- 
effectiveness,  implementation  problems,  and  design  tools. 

3.0  Scope.  This  project  is  considered  as  a  separate  technical  area  and  can  be  treated  that 
way.  However,  it  is  recognized  that  it  will  be  dependent  upon  several  other  projects  in  the 
theory  of  testing.  It  will  be  influenced  by  and  directly  affect  other  design  criteria  and  de¬ 
sign  procedures. 

4.0  References 

•  The  Design  of  Totally  Self-Checking  Combinational  Circuits,  Smith,  JE,  Report 
4-737,  thesis,  August  1976,  University  of  Illinois,  Urbana,  1L 

•  On  the  Existence  of  Combinational  Networks  with  Arbitrary  Multiple  Redundan¬ 
cies,  Smith,  JE,  Metze,  G.  Report  R-692.  October  1975,  University  of  Illinois, 
Urbana,  IL 

5.0  Background.  Recovery  techniques  in  fault-tolerant  systems  are  implemented  to 
compensate  for  detected  faults  and  these  techniques  are  germane  to  the  successful  design  of 
systems.  In  their  simplest  form  they  consist  of  voter  circuits;  in  the  more  complex  systems 
they  consist  of  extensive  and  complicated  software  programs.  The  theory  of  fault  recovery 
is  generally  considered  a  system  engineering  decision  process.  This  project  will  develop  the 
system  design  approaches  for  effective  application  of  recovery  techniques. 
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TRANSIENT  FAULTS 


1 .0  Introduction  (objective).  The  goal  of  this  project  is  to  develop  a  methodology  for 
use  in  systems  design  to  make  systems  fault-tolerant  to  transient  faults.  The  transient  faults 
may  he  very  short .  ie.  the  drop  of  a  bit  in  a  data  stream,  or  longer  burst  errors  in  which 
there  is  a  series  of  faults  in  a  contiguous  group.  The  methodology  should  address  both  de¬ 
tection  and  correction  of  transient  faults. 

2.0  Technical  approach.  The  following  tasks  are  critical  to  this  project. 

2.1  Definition  and  classification.  This  task  will  provide  the  foundation  for  the  research 
into  transient  faults.  It  will  define  the  different  classes  of  transients  in  terms  of  speed, 
nature  of  occurrence,  and  impact.  The  physical  nature  of  transient  faults  will  be  considered 
and  related  to  the  testing  problems  that  each  poses. 

2.2  On-line  built-in-test  lor  transients.  The  advance  of  component  and  device  technolo¬ 
gy  has  pointed  to  the  feasibility  of  monitoring  circuits  in  an  on-line  continuous  test  that  can 
detect  transient  faults.  Once  they  are  detected,  recovery  can  be  applied.  This  task  will 
analyze,  review,  and  develop  techniques  for  on-line  circuit  testing  and  concurrently  assess 
the  cost-effectiveness  of  this  approach. 

2.3  Software  recovery  for  transients.  Transients  require  recovery  techniques  identical  to 
methods  used  for  permanent  faults.  This  task  will  assess  the  effectiveness  and  practicality  of 
software  recovery  techniques  as  applied  to  transient  faults. 

3.0  Scope.  This  project  will  be  treated  separately  from  other  testing  research  because  of 
its  unique  character.  This  project  is  one  that  involves  investigating  the  underlying  physical 
phenomena  as  well  as  testing  techniques. 

4.0  References 

•  The  STAR  (Self- Testing  and  Repair)  Computer:  An  Investigation  of  the  Theory 
and  Practice  of  Fault-Tolerant  Computer  IXsign,  Avi/.ienis,  A.  Gilley,  GC,  el  al, 
IEEE  Transactions  on  Computers,  vol  C-20.  no  1  I .  p  1312-1321.  November  I  <)7 1 

5.0  Background.  Transient  faults  occurring  in  electronic  systems  are  difficult  to  cope 
with  due  to  their  elusive  nature.  Techniques  are  needed  to  detect  and  isolate  them.  This 
project  will  address  the  nature  of  transient  faults  and  the  test  techniques  which  are  applica¬ 
ble.  Transients  are  serious  if  they  are  not  part  of  a  self-compensating  system  (eg.  using  roll¬ 
back  or  built-in-test).  They  introduce  errors  randomly  and  usually  when  least  desirable: 
therefore,  they  can  be  catastrophic.  In  the  past  such  faults  were  accepted  as  undetectable  as 
well  as  impossible  to  isolate;  however,  recent  advances  in  technology  indicate  that  they  can 
be  determined  and  corrected.  In  addition,  once  detected,  they  can  be  used  to  predict  over¬ 
all  failure  of  the  device/system. 


IDENTIFICATION  OF  FAILURE  MODES 


1 .0  Introduction  (objective).  The  goal  of  this  project  is  to  identify  and  classify  failure 
modes  that  are  likely  to  occur  in  the  various  classes  of  systems. 

2.0  Approaches.  The  potential  failure  modes  for  different  types  of  circuits  will  be  iden¬ 
tified  and  categorized.  The  following  types  of  equipment  should  be  considered: 

Analog  modules 
Digital  modules 

Combination  analog  and  digital  modules 
Shipboard  equipments 
Airborne  equipments 
Shore-based  equipments 

Other  types  that  may  have  unique  features  or  characteristics 

To  gather  data  on  failure  modes  for  these  equipments,  at  least  the  following  sources 
of  information  should  be  used:  Navy  repair  depot  records,  Navy  shipboard  repair  records, 
and  Navy  shore  station  repair  records. 

After  the  basic  data  are  gathered,  they  should  be  grouped  by  type  of  failure,  fre¬ 
quency  of  failure,  cost  of  repair,  and  other  categories. 

3.0  Scope.  The  scope  of  this  project  takes  in  all  types  of  military  electronic  equipment, 
at  all  maintenance  levels. 

4.0  References 

•  Design  Validation  in  Hierarchical  Systems,  Losleban,  P,  1  2th  Design  Automations 
Conference,  p  431-438,  23-25  June  1975,  Publ  IEEE,  National  Security  Agency, 
MD 

•  A  Highly  Efficient  Redundancy  Scheme  Self-Purging  Redundancy,  Losq,  J,  IEEE 
Transactions  on  Computers,  vol  C-25,  no  6,  p  569-578,  June  1976,  Stanford 
University 

5.0  Background.  Fault-tolerant  design  that  is  based  on  statistical  failure  rates  is  one  of 
the  basic  requirements  for  a  realistic  design  and  requires  realistic  failure  rate  data.  These 
data  have  to  be  organized  into  categories  by  type  of  equipment.  Failure  mode  data  can  then 
be  used  to  establish  guidelines  for  fault-tolerant  design. 


FAULT-TOLERANT  SOFTWARE 


1 .0  Introduction  (objective).  It  is  the  purpose  of  this  project  to  develop  a  theoretical 
foundation  for  detecting  and  isolating  errors  in  software.  It  must  also  address  the  recovery 
concepts  that  can  he  applied.  This  program  should  relate  the  theoretical  foundation  to 
problems  of  system  verification,  fault  tolerance,  and  other  specific  software  design  efforts. 

2.0  Approach.  This  program  will  consist  of  the  following  sequence  of  tasks. 

2.1  Techniques  of  software  testing.  This  task  will  review,  analyze,  and  develop  tech¬ 
niques  for  testing  and  analyzing  software  to  determine  the  extent  to  which  it  performs  the 
logical  functions  intended  by  its  creator. 

2.2  Theory  of  software  testing.  This  task  will  determine  the  extent  to  which  a  theory  of 
software  testing  is  feasible  and  will  develop  the  areas  required  to  complete  the  theory. 

2.3  Program  verification.  The  need  to  verify  correctness  and  provide  for  fault  tolerance 
is  related  to  the  mathematical  proofs  that  demonstrate  the  logical  behavior  of  a  program  is 
as  specified.  It  is  also  related  to  the  constructive  approach,  which  stresses  correct  develop¬ 
ment  of  a  program.  Both  of  these  approaches  rely  on  the  programmer’s  ability  to  abstract 
properties  of  the  program.  These  are  referred  to  as  inductive  proofs  of  correctness. 

3.0  Scope.  This  project  must  be  addressed  from  the  perspective  of  the  software  engineer 
who  is  concerned  with  the  correctness  and  robustness  (fault  tolerance)  of  software.  It 
should  result  in  guideline  standards  and  a  set  of  recommendations  for  specifying,  verifying, 
and  design  of  software. 

4.0  References 

•  Toward  a  Theory  of  Test  Data  Selection,  Goodenough,  JB,  and  Gerhart.  SL, 

PROG  ACM  1975  Internationa^ Confercnceon  Reliable  Software,  ACM,  New 
York  1975,  p  493-570 

•  ACM  Computing  Surveys:  Special  Issue  on  Reliable  Software  I:  Software  Valida¬ 
tion.  vol  X,  no  3,  September  1976 

•  Probabilistic  Models  for  Software  Reliability  Prediction,  Martin  1.  Shoomar.  Assoc 
Prof  of  EL.  Poly  l  ech.  Brooklyn.  1972  International  Symposium  on  Fault- 
folerant  Computing,  Newton.  MA.  June  19-21 ,  1972 

•  F-T  Software  for  Real-Time  Applications.  H  Meet.  AERO  Space  Corp.  El  Segtm- 
do.  (  A.  ACM  Computing  Surveys.  Special  Issue.  Reliable  Software  IE  Fault- 
Tolerant  Software,  vol  X.  noj4.  December  1976.  p  391-407 

•  Design  of  Self-Checking  Software.  SS  YAU  and  RC  Cheung,  II. EE  Proceedings  of 
1975  International  Conference  on  Reliable  Software 

•  System  Structure  for  Software  Fault  Tolerance.  Randell,  B,  Sigplan  Not.  vol  10. 
no  6.  p  437-439.  June  1975.  University  of  Newcastle  Upon  Tyre.  England 

•  Software  Design  Validation  Tool.  Carpenter.  LI),  Tripp,  IT.  Sigplan  Not.  vol  10. 
no  (>.  p  395-400,  June  I  975.  Boeing  Computer  Services.  Inc.  Seattle.  W  A 
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•  Fault-Tolerant  Software  for  Spaceeraft  Applications,  Hecht,  H,  Final  Report,  10 
December  1975,  Aerospaee  Corporation,  El  Segundo,  CA,  eontraet  F04701-75- 
C-0076 

5.0  Background.  Reliable  software  implies  robustness  (fault  tolerance).  That  is,  it  is 
expected  to  deliver  a  certain  minimum  level  of  services  even  when  faced  with  an  unexpected 
or  hostile  environment  (such  as  hardware  failure  or  bad  data).  Techniques  for  detection  and 
isolation  of  faults  are  classified  into  testing  (analysis  of  responses  of  a  selected  set  of  input 
data)  and  verification  (mathematical  proof  and  constructive  approach). 


HIGH-ORDER  LANGUAGE  CONSTRUCTS  FOR  FAULT-TOLERANT  SYSTEMS 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  the  analysis  and  develop¬ 
ment  of  fault-tolerant  language  constructs  suited  for  inclusion  in  the  DoD  high-order  lan¬ 
guage  standard. 

This  project  will  address  language  constructs  that  allow  incorporation  of  program 
redundancy  into  a  block-structured  program  in  a  well  structured  manner.  It  will  not  address 
system  architecture  and  design  methodologies  for  fault-tolerant  programs. 

2.0  Approach.  The  project  will  be  conducted  in  the  following  sequence  of  tasks. 

2.1  Analyze  fault-tolerant  program  constructs.  This  task  will  collect,  review,  and  ana¬ 
lyze  all  constructs  required  for  fault-tolerant  programs.  It  will  provide  a  list  that  can  be  in¬ 
corporated  in  the  DoD  high-order  language. 

2.2  RMS-2  fault-tolerant  constructs.  The  list  of  constructs  will  be  reviewed  for  possible 
incorporation  in  the  Navy’s  CMS- 2  tactical  language. 

2.3  DoD  high-order  fault-tolerant  constructs.  The  list  of  fault-tolerant  program  con¬ 
structs  will  be  reviewed  for  possible  incorporation  in  extensible  DoD  high-order  language 
standards.  It  will  determine  guidelines  and  recommendations  for  full  use  of  fault-tolerant 
programs. 

3.0  Scope.  This  project  must  be  considered  in  relation  to  other  research  relative  to 
fault-tolerant  programs;  namely,  systems  architecture  and  systems  design  methodology. 
However,  this  work  can  be  performed  concurrently  with  other  efforts. 

4.0  References 

•  Recent  Developments  in  Software  Fault-Tolerance  Through  Program  Redundan¬ 
cy,  Kim,  KH.  and  Ramamoorthy,  CC,  Proc  of  the  10th  Hawaii  International 
Conference  on  System  Science,  January  6-7,  1977.  p  234-239 

5.0  Background.  The  most  effective  methods  of  meeting  complex  availability  require¬ 
ments  and  reducing  development  and  support  costs  are  to  develop  methodologies  and  tools 
for  systematic  incorporation  of  redundancy  programming  in  a  well  structured  form.  Good 
structuring  of  a  fault-tolerant  program  is  particularly  important,  since  the  use  of  program 
redundancy  increases  the  program  size.  Fault-tolerant  programs  have  generally  higher  devel¬ 
opment  costs  and  higher  processing  cost.  The  high  processing  cost  of  fault-tolerant  pro¬ 
grams  is  due  to  increased  processor  time  and  increased  storage  space  required  for  executing 
redundant  program  components. 
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ANALOG  FUNCTIONAL  REDUNDANCY 


1 .0  Introduction  (objective).  This  project  will  address  several  approaches  that  will  result 
in  the  use  of  functional  redundancy  in  analog  circuits.  This  project  must  achieve  a  structur¬ 
ing  of  analog  fault  tolerance  test  techniques  and  capabilities.  It  must  identify  those  charac¬ 
teristics  of  analog  circuits  which  can  be  exploited  in  systems.  It  must  demonstrate,  through 
examples,  the  capabilities  that  are  possible  by  quantitative  analysis  and  test  evaluation. 
Analog  circuits  span  the  entire  known  frequency  range,  are  both  high  and  lower  power,  and 
represent  a  complexity  of  functions  that  rivals  the  ability  of  digital  circuits. 

2.0  Approach.  This  project  will  consist  of  several  concurrent  tasks,  with  each  contribut¬ 
ing  to  the  overall  understanding  and  development  of  analog  functional  redundancy  tech¬ 
niques.  These  are  as  follows. 

2.1  Classification  of  analog  functions.  The  wide  range  of  analog  circuits  and  functions 
will  be  classified  into  structures  that  can  be  a  basis  for  applying  various  redundancy  con¬ 
cepts.  These  classifications  may  be  frequency,  power,  size,  and  other  characteristics.  The 
task  should  be  conducted  from  the  perspective  of  the  designer  who  must  use  fault  tolerance 
to  achieve  design  objectives. 

2.2  Redundant  sensors.  Analog  sensors  have  been  a  continuous  source  of  failures.  Tech¬ 
niques  should  be  analyzed,  developed,  and  documented  to  permit  more  reliable  sensing  sys¬ 
tems.  Included  in  this  task  are  phased  array  antennas  and  sonar  transducers  as  well  as  other 
pressure,  temperature,  and  environmental  sensors. 

2.3  Redundant  filters  techniques.  Among  the  most  commonly  used  analog  functions  is 
signal  filtering.  The  task  will  address  the  use  of  redundancy  in  filter  arrays.  This  problem  is 
particularly  critical  in  radar,  EW,  and  sonar  signal  processing  applications. 

2.4  Analog  control  reliability.  The  control  (analog  or  digital)  of  analog  circuits  is  critical 
to  complex  array  processing,  and  this  task  will  address  the  area  of  redundant  and  built-in 
test  in  control.  Such  a  task  will  be  critical  to  redundant/voter  circuit  design.  It  will  address 
the  apparent  vs  the  real  need  for  a  highly  reliable  voter  circuit. 

3.0  Scope.  This  project  must  draw  upon  existing  and  planned  systems  design,  for  exam¬ 
ple,  and  future  analog  functions.  This  project  must  involve  experts  in  signal  processing, 
systems  engineering,  and  component  and  device  technology. 

4.0  References 

•  Research  on  Fault  Analysis  of  Analog  Circuits,  Bedrosian,  SD.  Ho,  DeYuan,  inter¬ 
im  report.  Office  of  Naval  Research.  Alexandria.  VA.  August  1976,  contract 
N00014-75-C-0768,  University  of  Pennsylvania 

5.0  Background.  Analog  circuits  provide  a  unique  challenge  to  fault-tolerant  design 
since  they  are  particularly  complex  and  far  more  prone  to  failure  than  digital  circuits.  Ana¬ 
log  circuits  often  require  close  component  tolerances  that  tend  to  slowly  degrade.  This  slow 
degradation  presents  a  special  challenge  to  the  designer  of  redundant  circuits. 
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FUNCTIONAL  TEST  DESIGN  THEORY 


1 .0  Introduction  (objective).  Functional  test  is  an  area  in  which  little  theory  of  testing 
has  been  developed.  This  project  will  develop  the  theory  of  functional  testing,  from  its  theo¬ 
retical  foundations  to  its  principles  of  application  in  systems,  software,  and  hardware.  This 
program  can  become  the  basis  of  functional  testing  of  microcomputers  as  well  as  analog 
functional  testing. 

2.0  Approach.  The  project  involves  the  following  sequence  of  tasks. 

2.1  Theory  of  functional  test.  The  basic  mathematical  engineering  foundation  of  func¬ 
tional  test  should  be  developed  in  a  manner  similar  to  that  of  software  module  definition. 

This  task  should  address  definition  in  terms  of  completeness  and  consistency.  These  con¬ 
cepts  should  produce  an  approach  toward  performing  functional  tests  that  are  related  to 
technology,  although,  in  some  isolated  cases,  the  approach  may  be  independent  of  technology. 

2.2  Analogy  functional  test.  Concepts  for  analog  functional  test  should  be  evaluated, 
analyzed,  developed,  and  documented.  This  task  should  be  an  extension  of  the  above  task. 

2.3  Digital  (microcomputer)  functional  test.  The  technique  to  functionally  test  a  micro¬ 
computer  or  equivalent  digital  logic  is  the  primary  objective  of  this  task.  Based  on  task  2.1 
above,  this  task  should  demonstrate  and  prove  the  applicability  of  functional  test  to  micro¬ 
computers. 

3.0  Scope.  This  project  should  be  related  to  the  next  higher  (and  lower)  level  of  the 
test  hierarchy.  This  relationship  is  critical  to  the  successful  development  of  a  fault-tolerant 
systems  capability. 

4.0  References 

•  A  Graph  Model  for  Fault-Tolerant  Computing  Systems,  Hayes,  JP,  IEEE  Trans 
Comput,  vol  C-25,  no  9,  September  1976 

•  A  Reliability  Model  for  Various  Switch  Designs  in  Hybrid  Redundancy,  Ingle,  A, 
Siewiorek.  DP,  IEEE  Transaction  Computers,  vol  C-25,  no  2,  pp  1 1 5-1 33, 

February  1976,  Carnegie-Mellon  University 

5.0  Background.  Functional  testing  has  often  been  left  out  of  a  system  design  until  the 
last  stages.  This  leads  to  testing  as  an  add-on,  rather  than  a  built-in,  function.  If  a  good 
applied  theory  of  testing  is  developed,  functional  testing  can  become  part  of  system  design 
from  the  beginning.  In  addition,  functional  testing  can  become  the  control  portion  of  a  fault- 
tolerant  system  design. 


V  V  -  -  •  •  V  *  • 


FAULT  TOLERANCE  VALIDATION  AND  VERIFICATION 


1.0  Introduction  (objective).  The  purpose  of  this  project  is  the  development  of  a  sound 
approach  for  the  validation  and  verification  of  fault  tolerance  systems  capability.  This  project 
will  address  the  validation  and  verification  problem  from  the  perspective  of  the  systems  acqui¬ 
sition  manager,  beginning  with  operational  requirements  and  proceeding  through  systems 
acceptance  test  and  evaluation.  This  project  must  identify  techniques,  tools,  and  standards 
that  will  simplify  the  verification  process  and  improve  its  effectiveness.  It  must  consider  all 
concepts  within  the  context  of  coping  with  the  increasing  complexity  of  advanced  systems 
due  to  new  component  and  device  technology. 

2.0  Approach.  There  are  several  important  concepts  that  can  be  integrated  into  a  sound 
approach  to  validation  and  verification.  Each  concept  addresses  different  steps  in  the  design 
and  acquisition  process  and  can  be  combined  into  an  integrated  approach. 

2.1  Verification  of  specifications/standards.  The  operational  requirements  must  be  trans¬ 
lated  into  specifications  so  that  the  requirement  can  be  traced  into  the  specification.  In 
addition,  specifications  that  guarantee  that  a  requirement  is  satisfied  must  be  identified.  The 
analysis  tools  that  permit  verification  of  specifications  must  be  identified  and,  if  necessary, 
developed.  It  is  possible  that  the  risk,  reliability,  and  systems  analysis  tools  for  system  con¬ 
cept  formulation  are  similar,  if  not  identical,  to  the  tools  necessary  for  validation  and 
verification. 

2.2  System  concept  formulation  verification  and  validation.  The  formulation  of  system 


concepts  is  critical  to  life-cycle  costs.  In  particular,  system  concept  will  affect  about  70%  of 
the  life-cycle  costs,  and  it  is  necessary  to  select  the  best  concept  in  terms  of  cost.  This  task 
will  determine  the  first  stages  of  system  concept  formulation  techniques. 

2.3  Design  for  validation  and  verification.  Design  of  a  system  should  provide  a  structure 
that  can  lead  to  simplification  of  the  validation  and  verification  process.  Functional  and 
mechanical  partitioning  must  be  performed  to  permit  testing  of  various  components  and  func¬ 
tions  independently;  then  an  approach  must  be  provided  for  verification  of  integrated  sys¬ 
tems.  This  task  will  address  fault-tolerant  design  rules  that  will  lead  to  design  of  verifiable 
systems. 

2.4  System  acceptance  test  theory.  This  task  will  develop  the  theory  of  performing  system 
acceptance  test  and  evaluation. 

3.0  Scope.  The  perspective  of  the  verification  task  is  one  which  addresses  the  constructive 
techniques  and  further  recognizes  that  many  other  areas/projects  have  immediate  and  direct 
relationship  to  verification.  The  project  must  be  conducted  by  personnel  with  a  combination 
of  skill  and  experience  in  various  disciplines. 

4.0  References 

•  An  Artwork  Design  Verification  System.  Baird.  US,  Clio.  YE,  12th  Design  Auto¬ 
mations  Conf,  p  414-420.  23-25  June  1 V) 75 ,  RCA  l  aboratories 
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5.0  Background.  Without  verification  procedures  it  will  be  difficult,  if  not  impossible, 
for  the  government  to  put  a  fault  tolerance  requirement  into  a  system  specification.  There 
must  be  a  clearly  defined  procedure  for  the  government  to  use  in  buying  a  system  from  a 
contractor:  otherwise  the  fault-tolerant  specifications  can  only  be  design  goals  and  not  design 
requirements. 


ESTIMATION  OF  CONFIDENCE  LIMITS  TESTING  LARGE  LOGIC  NETWORKS 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  to  provide  a  methodology 
and  means  for  accurately  determining  confidence  limits  for  the  reliability  of  large  digital 
logic  networks  without  exhaustively  exercising  all  possible  input  sequences  or  simulating  all 
logical  faults  in  the  network.  This  will  involve  development,  implementation,  and  demon¬ 
stration  of  a  logic  simulation  and  fault  analysis  system  that  will  serve  as  a  tool  for  determina¬ 
tion  of  reliability  confidence  limits  of  digital  networks  consisting  of  up  to  15  000  elements 
by  simulating  selected  combinations  of  faults  against  specific  sequences  of  operations  and, 
from  the  absence  of  these  faults,  inferring  the  correctness  of  the  entire  system. 

Faults  will  be  modeled  via  probabilistic  techniques,  and  will  occur  singly  and  in  mul¬ 
tiple.  The  system,  when  completed,  should  be  capable  of  reasonably  approximating  the 
actual  operation  of  an  existing  logic  network,  including  occurrence  and  location  of  failures, 
when  all  necessary  processing  and  operating  parameters  are  input  to  the  model. 

2.0  Approach.  This  project  will  be  accomplished  in  a  series  of  several  interrelated  tasks. 

2.1  Simulation  model.  This  task  will  include  devising  a  simulation  model  for  large  logic 
blocks  that  is  economical  in  terms  of  host  computer  storage  and  speed  of  simulation  yet  is 
sufficiently  adaptable  that  a  number  of  different,  simultaneous  failure  mechanisms  can  be 
accommodated. 

2.2  Failure  injection  routines.  This  task  will  create  a  set  of  probabilistic  failure  injection 
routines  for  use  with  the  simulator  developed  in  2.1  above.  The  routines  will  be  based  on 
logic  operation  and  observed  anomalies. 

2.3  Combining  failure  routines  with  simulator.  This  task  will  consist  of  combining  the 
simulation  and  the  failure  injection  routines  developed  in  2.2  above,  then  adjusting  them  to 
correctly  predict  the  operation  of  logic  modules  with  simple  (one  or  two  parameters)  defect 
mechanisms. 

2.4  Confidence  limits.  This  task  will  include  both  (a)  relating  confidence  limits  to  non- 
exhaustive  measures  of  correctness  of  logic  operation,  and  (b)  devising  methods  of  measuring 
the  thoroughness  of  tests  and  assigning  confidence  limits  to  these. 

2.5  Small-scale  demonstration.  This  task  will  combine  tasks  2.1  through  2.4  into  a  small- 
scale  demonstration  on  a  general-purpose  computer  and  extend  these  techniques  for  larger 
logic  blocks  and  more  simultaneous  defect  mechanisms. 

2.6  Techniques  for  larger  logic  blocks.  This  task  will  extend  the  above  tasks  into  larger 
logic  blocks  and  more  simultaneous  defect  mechanisms.  Included  in  this  task  is  the  concep¬ 
tual  design  of  a  system  to  perform  prediction  of  confidence  limits  for  up  to  15  000-gate 
networks,  including  data  base  and  processing  requirements. 

2.7  Validation.  This  task  will  validate  the  system  through  comparison  with  actual  tests 
of  well  characterized,  known  defective  logic  networks,  and  then  a  demonstration  of  a  small- 
scale  digital  network  on  the  system  will  be  performed. 
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3.0  Scope.  This  project  will  involve  logic  designers  that  have  worked  with  a  number  of 
different  circuit  card  testers. 

4.0  References 

•  Reliability  Modeling  of  NMR  Networks.  Abraham.  JA,  Siewiorek.  DP.  June  1974, 
avail  NTIS,  Springfield.  VA 

*  A  Logic  System  for  Fault  lest  Generator.  Akers,  SB,  Jr,  IEEE  Transactions  on 
Computers,  vol  C-25,  no  h.  p  h20-<>30.  June  1 97b.  GE  Electronics  Laboratory, 
Syracuse,  NY 

5.0  Background.  A  number  ot  log!  ■  ard  testers  have  been  built  that  may  be  adapted  to 
testing  fault-tolerant  digital  cards.  Mow  the  various  systems  can  be  adapted  and  used  will  be 
of  great  importance  for  maintaining  digital  equipment. 


COMMAND  CONTROL  FAULT  TOLERANCE 


1 .0  Introduction  (objective).  The  goal  of  this  project  is  to  develop  a  design  and  evalua¬ 
tion  methodology  for  fault-tolerant  command  control  global  networks.  The  methodology 
should  address  transient  faults,  permanent  faults,  and  spoofing  faults  in  relation  to  the  sys¬ 
tem  as  an  entity  and  in  relation  to  each  of  its  major  components  ADP  systems,  communi¬ 
cation  systems  (inter-  and  intra-platform),  sensor  systems,  weapons  systems,  and  human 
systems. 

2.0  Approach.  The  project  should  first  address  each  class  of  faults  separately,  then 
address  the  problem  of  the  design  and  evaluation  of  systems  that  encompass  all  the  classes  of 
faults.  Specifically,  the  tasks  are  to  identify  and  then  investigate  the  impact  on  command 
control  system  components  and  the  total  system  of 

Transient  far'ts  both  long  term  and  short  term, 

Permanent  faults  single  and  multiple. 

Spoofing  faults  detection  and  countermeasures,  and 
Combinations  of  the  above  classes  of  faults 

and  then  to  develop  a  methodology  to  evaluate  fault-tolerant  C^  systems. 

The  particular  parameters  which  need  to  be  addressed  in  relation  to  the  different 
classes  of  faults  are 

Survivability 

Responsiveness 

Availability 

Flexibility 

Invulnerability 

Interoperability 

Usability 

Capacity 

3.0  Scope.  This  project  will  be  concerned  with  improving  the  performance  of  the  mili¬ 
tary’s  command  control  and  communications  systems  (C^).  An  overall  understanding  of  the 
C^  will  be  needed  by  personnel  assigned  to  this  project. 

4.0  References 

•  Navy  Command  Control  and  Communication  System  System  Concept,  Naval 
Warfare  Effectiveness  Group,  NOSC,  San  Diego,  California,  12  July  1976 

•  Navy  Command  Control  and  Communications  System  Definition  of  Performance 
Measures,  Naval  Warfare  Effectiveness  Group,  NOSC,  San  Diego,  California, 

1  January  1977 

5.0  Background.  The  availability  of  the  military  C^  systems  on  a  global  basis  is  of  major 
importance,  especially  during  the  periods  preceding  and  during  a  conflict.  A  fault-tolerant 
design  could  contribute  to  the  availability  of  a  system  during  critical  periods. 
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Because  of  the  many  sensors  now  available,  the  sophisticated  data  processing,  and 
man-machine  interfaces  required,  it  is  critical  that  fast  communications  be  available  within  a 
platform,  among  platforms,  and  on  a  global  basis  to  give  a  commander  critical  information 
when  he  needs  it.  This  is  true  during  the  entire  conflict  scale  from  a  prehostilities  crisis  to 
global  war  because  data  acquired  from  local  sensors  as  well  as  from  those  operated  by  shore- 
based  commanders  in  support  of  task  forces  will  be  correlated  and  evaluated  in  near  real  time 
as  inputs  to  a  commander’s  decision-making  process.  Timeliness  and  accuracy  are  paramount 
to  this  process. 

Because  of  the  long  range  of  modern  weapons  a  small  conflict  can  have  worldwide 
impact.  Therefore,  it  is  important  that  a  local  commander  have  highly  reliable  and  flexible 
communications  with  higher  authorities  and  also,  in  the  other  direction,  communications 
with  those  under  him.  A  missile  fired  by  mistake,  or  at  the  wrong  target,  could  be  a  disaster. 


ARCHITECTURES  FOR  AVAILABILITY  REQUIREMENTS 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  the  review,  analysis,  and 
development  of  system  architecture  and  networks  which  will  result  in  increased  fault-tolerant 
systems  capabilities.  Through  this  project,  system  architecture  tradeoffs  will  be  established 
and  applied,  to  determine  practical  system  structures  meeting  all  requirements.  This  project 
must  address  the  selection  of  architectures  with  regard  to  many  different  criteria,  including 
cost,  timing  (queuing),  complexity,  maintainability,  and  automatic  recovery. 

2.0  Approach.  This  project  will  have  the  following  tasks. 


2.1  Analysis  of  classical  architectures.  This  task  will  evaluate  existing  distributed  archi¬ 
tectures  such  as  the  STAR  and  the  RING,  and  other  classical  multiprocessor  configurations 
will  be  analyzed  for  their  ability  to  operate  in  a  fault-tolerant  condition.  This  task  will  pro¬ 
vide  detailed  analysis,  simulation,  or  other  data  that  will  permit  quantification  of  the  results. 

2.2  Distributed  nets  of  computers.  This  task  will  result  in  the  formulation  of  an  approach 
to  permit  quantitative  evaluation  of  networks.  Current  techniques  have  not  addressed  this 
issue,  and  extensive  development  of  the  foundations  of  fault-tolerant  evaluation  is  necessary. 
This  task  combines  the  use  of  reliability  models  with  queuing  models  (or  other  performance 
analysis  tools). 

2.3  Signal  processing  and  analog  system  architectures.  The  analog  and  signal  processing 
architectures  requiring  extensive  design  in  fault  tolerance  capabilities  will  be  investigated  by 
this  project.  Distributed  sensors  such  as  phased  array  antennas,  transducers,  and  banks  of 
filters  will  be  reviewed  for  fault  tolerance  design  capabilities.  A  detailed  analysis  will  be 
provided. 

3.0  Scope.  This  project  must  be  coordinated  with  other  important  research  areas  includ¬ 
ing  microcomputers,  fault  prediction,  and  the  taxonomy  of  fault  tolerance  and  maintainability 

4.0  References 


•  Theory  of  Fault  Tolerance  1976  Annual  Report,  vol  1  and  11.  contract  N00014-75- 
C-001 1 ,  Office  of  Naval  Research,  December  7,  1976 

•  Fault-Tolerant  Computing:  An  Introduction,  Avizienis,  A,  National  Science  Foun¬ 
dation,  Grant  no  MSC  72-03633  A4,  Computer  Science  Dept,  UCLA,  Los  Angeles, 
CA  90024,  January  22.  1977 

5.0  Background.  Current  efforts  select  structures  which  are  intended  to  meet  perform¬ 
ance  bounds  and  only  later  is  it  discovered  that  the  fault  tolerance  requirements  cannot  be 
met. 

The  importance  of  this  project  is  reflected  in  the  fact  that  system  architectures  for 
meeting  performance  designs  are  in  conflict  with  those  for  fault-tolerant  design  (eg.  perform¬ 
ance  design  dictates  single  centralized  processing,  while  fault  tolerance  design  requires  dis¬ 
tributed  processing  and  multichannel  communications). 
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COMMUNICATION  PROTOCOL  FOR  FAULT  TOLERANCE 


1 .0  Introduction  (objective).  Bus  protocol  for  the  reconfigurability  of  bus  structured 
system  architecture  is  critical  to  simple  and  cost-effective  system  design.  This  project 
addresses  the  various  requirements  and  specifications  that  lead  to  standardization  of  bus 
protocol.  It  will  include,  to  a  lesser  extent,  the  feasibility  of  technology-independent  input- 
output  (I/O)  interfaces.  The  bus  protocol  will  establish  which  units  are  master  and  slaves, 
which  unit  will  transmit  and/or  receive.  It  will  address  the  set  of  commands  and  acknowledg¬ 
ments  that  are  necessary  to  determine  which  unit  has  failed,  where  the  unit  is,  and  what 
recovery  procedures  are  to  be  taken.  This  project  is  critical  to  future  distributed  fault- 
tolerant  systems  design. 

2.0  Approach.  This  project  will  review,  analyze,  and  catalog  the  characteristics  of  various 
existing  and  proposed  bus  protocols  and  interface  techniques  to  determine  their  applicability 
to  fault-tolerant  system  design.  It  will  ultimately  provide  recommendations  for  selection  of 
these  techniques  as  military  standards. 

3.0  Scope.  The  project  must  be  conducted  to  achieve  simple  design  of  distributed  com¬ 
puter  systems.  It  is  a  project  that  will  be  critical  to  exploiting  microcomputers  in  large  net¬ 
works.  The  project  must  be  conducted  in  a  way  to  maximize  the  military’s  use  of  new 
technology  without  introducing  additional  life-cycle  costs  in  the  support  of  nonstandard 
interfaces. 

4.0  References 

•  MIL-STD-1 553 

•  SDMS 

•  ANSI 

5.0  Background.  The  area  of  bus  protocol  —  I/O  interface  has  been  one  of  continued 
concern,  as  evidenced  by  the  references. 
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SPECIFICATION  FOR  REDUNDANCY  MANAGEMENT 


1 .0  Introduction  (objective).  It  is  the  purpose  of  this  project  to  develop  a  common 
approach  to  fault  tolerance  system  design  through  redundancy  management.  It  will  encom¬ 
pass  the  exchange  of  information  within  industry  and  the  R&D  community  and  develop 
state-of-the-art  requirements  for  redundancy  management. 

2.0  Approach.  The  goal  of  this  project  is  to  develop  a  common  approach  to  redundancy 
methods  for  system  design.  The  project  will  consist  of  the  following  tasks. 

2. 1  Analyze  fault-tolerant  redundancy  design  techniques.  Collect,  review,  and  analyze 
various  fault-tolerant  system  design  techniques  used  in  government  and  industry  today. 

2.2  Develop  redundancy  management  techniques.  Assemble  the  data  developed  in  2.1 
above  into  redundancy  techniques  for  use  by  system  designers  as  a  guide  in  developing  a 
common  approach  to  system  design. 

3.0  Scope.  This  project  must  be  considered  in  relation  to  other  research  relative  to  fault- 
tolerant  programs;  namely,  systems  architecture  and  systems  design  methodology.  This 
project  can  be  performed  concurrently  to  the  other  efforts. 

4.0  References 

*  Fault-Tolerant  Computing:  An  Introduction,  Avizienis,  A,  UCLA,  Los  Angeles, 
January  22,  1977.  Work  supported  by  NSF  Grant  no  MCS  72-03633  A04 

*  Computer  Redundancy:  Design,  Performance,  and  Future,  RE  Kuehn,  IEE 
Transactions  on  Reliability,  vol  R-18,  no  1,  February  1969,  p  3-1  1 

5.0  Background.  Industry  is  currently  developing  and  using  systems  that  exploit  the 
theory  of  redundancy.  Although  redundancy  is  used  in  many  areas  to  develop  more  “relia¬ 
ble”  systems,  its  use  is  usually  an  individual  decision.  The  general  lack  of  technical  sources 
and  a  common  exchange  of  knowledge  in  the  application  of  redundancy  for  system  design, 
construction,  and  testing  of  fault-tolerant  systems  clearly  indicates  a  need  for  development 
of  specifications  and  standards  addressing  redundancy  management. 
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STANDARDS  AND  FAULT  TOLERANCE 


l  .0  Introduction  (objective).  The  goal  of  this  project  is  to  make  recommendations  for 
modifying  existing  military  specifications  to  include  fault-tolerant  design.  The  specifications 
to  be  addressed  are  the  Standard  Electronic  Module  (SEM),  DoD-1  high-order  languages, 
built-in-test  circuits,  built-in-test  design  handbook,  MIL-STD-1329  (Test  Points  and  Interface 
Selection),  and  others. 

2.0  Approach.  This  project  is  directed  toward  functions  and  control  of  standardization 
through  the  following  tasks. 

2.1  Definitions.  This  task  reviews  and  updates  all  definitions,  terminology,  and  other 
symbols  that  reflect  the  requirements,  specification,  and  test  of  fault-tolerant  systems.  It  will 
consolidate  definitions,  for  both  military  and  other  organizations,  into  a  useful  form  for  use 
by  the  military. 

2.2  Standard  specifications.  As  fault  tolerance  technology  matures,  a  set  of  standard 
specifications  will  be  evolved  and  then  applied.  This  task  will  lay  the  groundwork  for  this 
standard  to  be  developed  and  applied. 

2.3  Standards  -  software  and  hardware.  Standard  Electronic  (AVIONICS)  Modules 
(SEMs)  and  Standard  Software  Modules  (SSMs)  are  fundamental  to  the  simplification  of 
design  and  maintenance  and  are  critical  to  the  reduction  of  life-cycle  costs.  This  task  will 
review  current  and  proposed  efforts  and  will  indicate  developments  which  will  foster  the 
advance  of  fault-tolerant  technology. 

2.4  Test  and  evaluation.  Fault  tolerance  acceptance  testing  is  important  to  the  life-cycle 
cost  and  performance  effectiveness  of  military  systems;  this  task  will  address  the  practicality 
of  standards  in  testing. 

2.5  Other  standards.  Operational  requirements  through  evaluation  can  benefit  from  the 
use  of  standards.  This  task  will  explore  the  practicality  of  their  use  in  the  design  and  acquisi¬ 
tion  cycle. 

3.0  Scope.  This  project  will  require  interaction  with  all  other  tasks  to  determine  the 
characteristics  and  effectiveness  of  standards.  This  project  should  be  performed  in-house  to 
obtain  insight  into  methods  for  applying  standards. 

4.0  References 

•  Standardization,  Verman.  Lai  C,  Archor  Books,  1973 


5.0  Background.  The  use  of  standards  is  necessary  for  simplified,  cost-effective  applica¬ 
tion  of  fault  tolerance  in  military  systems.  Among  the  standards  to  be  reviewed,  developed, 
and  improved,  relative  to  fault  tolerance,  are  Standard  Electronic  Modules  (SEMs). 
Requirements/Specification  Standards,  DoD-1  higher-order  language,  built-in-test  circuits, 
built-in-test  design  handbooks,  and  MIL-STD-1329  (Test  points  and  Interface  Selection). 
This  task  will  be  instrumental  in  obtaining  fault-tolerant  capability  in  all  military  systems. 


SELF-DIAGNOSING  DESIGN  TECHNIQUES 


1.0  Introduction  (objective).  This  project  will  study  and  analyze  architecture,  functional 
partitioning,  and  module  and  component  design  features  necessary  for  achieving  self-diagnostic 
microprogrammable  capabilities  in  processors.  The  project  will  result  in  a  set  of  guidelines 
for  designing  LSI  components  and  modules  that  ensure  testability  of  the  component  and  the 
system  employing  the  component.  Redundancy  levels,  organization  of  functions,  require¬ 
ments  for  test  points  to  be  made  accessible,  and  procedures  for  using  these  data  to  design  and 
implement  a  self-diagnosable  processor  to  the  replaceable  module  level  will  be  postulated  and 
verified.  The  project  will  result  in  definition  of  general  specifications  for  design  and  utiliza¬ 
tion  of  LSI  components  and  modules  in  self-diagnosing  systems. 

2.0  Approach.  Several  tasks  are  required  in  a  variety  of  areas. 


2.1  Definition  of  baseline  processing  requirements.  A  set  of  processing  requirements  will 
be  selected  as  a  baseline  for  the  comparative  design  of  the  self-diagnostic  processors.  Because 
the  thrust  of  this  project  is  demonstration  of  the  self-diagnostic  properties,  an  extensive  set 
of  baseline  requirements  is  not  necessary.  However,  requirements  must  be  sufficient  to  vali¬ 
date  the  demonstration  with  respect  to  airborne  processors  rather  than  with  respect  to  a 
trivial  set  of  requirements. 

2.2  Comparison  of  competing  approaches.  A  comparison  of  self-diagnostic  design  tech¬ 
niques  applied  to  both  processors  constructed  of  bit-slice  components  and  processors  con¬ 
structed  of  monolithic  components  will  be  made  to  determine  the  proper  level  of  partitioning 
for  self-diagnosability.  While  it  may  be  premature  to  estimate  production  costs  of  the 
processors,  those  characteristics  that  are  known  to  contribute  to  cost  will  be  examined.  The 
comparison  will  consider  the  following  characteristics  at  a  minimum: 

Simplicity  of  design 
Parts  count  (variety  of  parts) 

Component  count 

Availability  of  components 

Performance 

Programmability 

Reliability 

Testability 

The  evaluation  will  result  in  a  top-level  design  of  a  processor  having  the  desired  self¬ 
diagnostic  capabilities  and  meeting  the  baseline  requirements  for  each  of  the  competing 
approaches.  At  the  end  of  this  task  the  competing  designs  and  results  of  the  evaluation  will 
be  reviewed  and  a  selection  made  of  the  approach  to  be  pursued  during  the  remainder  of  the 
project. 


2.3  Processor  design.  The  design  approach  selected  in  2.2  above  will  be  extended,  during 
this  task,  to  include: 


Selection  of  components 


Layout  of  processor 

Specification  of  components 

Generation  of  logic  diagrams 

Selection  of  instruction  set 

Writing  of  microcode 

Simulation  of  processor 

Evaluation  of  self-checking  coverage 

Identification  of  risk  areas  requiring  verification 

A  design  review  will  be  held  at  the  conclusion  to  evaluate  the  design.  The  high-risk 
areas,  to  be  further  refined,  will  be  selected  at  this  design  review. 

2.4  Processor  simulation.  The  self-diagnosing  processor  will  be  simulated  in  order  to 
demonstrate  that  the  design  will  perform  as  required  and  to  resolve  conflicts  and  uncertain¬ 
ties  in  the  design.  Development  of  a  detailed  instruction  level  simulation  is  not  intended  as 
part  of  this  task,  but  a  top-level  event  simulation  capable  of  identifying  conflicts  and  races  at 
the  register  level  will  be  required.  Some  detailed  simulation  of  portions  of  the  processor  may 
also  be  required. 

2.5  Breadboarding  of  high-risk  circuits.  Implementation  of  a  self-diagnosing  processor 
may  require  unproved  circuits  and  circuit  technologies.  In  an  attempt  to  reduce  the  risk 
associated  with  these  circuits,  breadboarding  may  be  desired.  Those  devices  and  circuits  that 
are  estimated  to  be  of  high  risk  will  be  identified  in  2.2  above  and  presented  during  the 
design  review. 

3.0  Scope.  This  project  will  explore  the  feasibility  of  developing  a  cost-effective  and 
practical  approach  to  designing  a  self-diagnosing  fault-tolerant  computer.  This  will  be  accom¬ 
plished  through  the  study,  analysis,  and  simulation  of  techniques  and  components  that  can 
be  used  in  digital  systems  that  are  self-diagnosable  to  the  replaceable  module  level,  resulting 
in  a  set  of  design  specifications  for  module  design  that  ensures  self-diagnosability. 

4.0  References 

•  US  Air  Force.  Air  Force  Systems  Command,  Aeronautical  Systems  Division/PPM 
FA  Wright -Patterson  AFB,  Ohio  45433,  Solicitation  F  3361  5-77-R-l  106, 

25  August  1976 

•  Sy  nthesis  and  Analysis  of  a  Cost  Effective,  Ultra-Reliable.  High  Speed.  Semicon¬ 
ductor  Memory  System.  FW  Husband  and  SA  Szygenda,  IEEE  Transactions  on 
Reliability,  vol  R-25.  no  3.  August  1976.  p  217-223 

■  System  Fault  Diagnosis  Masking,  Exposure  and  Diagnosability  Without  Repair. 
Russel.  JD.  IEEE  Transaction  Computer,  vol  C-24,  no  12,  p  1  145-1  155.  December 
1975.  University  of  Wisconsin 

•  Managing  the  Development  of  Reliable  Software,  Williams,  RD,  Sigplan  Not,  vol 
10.  no  6.  p  3-K.June  1975,  TRW,  Redondo  Beach,  CA 

•  Reliable  Hardware  Software  Architecture.  Wulf,  WA.  Sigplan  Not,  vol  10,  not), 
p  122-130.  June  1975.  (  arnegie-Mellon  University.  Pittsburgh,  PA 
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•  Design  of  Self-Checking  Software,  Yau,  SS,  Cheung,  RC,  Sigplan  Not,  vol  10,  no  6, 
p  450-457,  June  1975,  Northwestern  University,  Evanston 

5.0  Background.  Most  large  computers  have  dedicated  processors  that  continually  test 
the  health  of  the  main  processors.  This  philosophy  may  possibly  be  used  with  LSI  compo¬ 
nents.  If  an  LSI  component  could  signal  its  own  failure,  that  signal  could  become  a  control 
signal  in  a  fault-tolerant  system. 
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REDUNDANT  MICROCOMPUTERS 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  the  review,  analysis,  and 
development  of  redundant  microcomputer  concepts  for  fault-tolerant  systems  design. 

2.0  Approach.  It  is  necessary  to  formulate  a  baseline  that  will  be  a  structure  for  the 
analysis  and  design  of  redundant  computers.  It  will  be  achieved  through  the  execution  of 
several  concurrent  tasks. 

2.1  Redundant  computer  analysis.  The  previous  use  of  redundancy  with  general-purpose 
computers  has  been  successfully  applied,  and  this  task  will  analyze  advantages,  disadvantages, 
major  design  characteristics,  etc,  of  the  configurations.  The  analysis  must  address  the  major 
characteristics  that  are  applicable  to  use  with  microcomputers. 

2.2  Memory  sharing  redundant  configurations.  Many  applications  require  shared  pro¬ 
grams  that  are  stored  in  a  memory  common  to  many  processors.  This  task  will  analyze  and 
develop  redundancy  concepts  involving  memory  interleaving,  switching  input/output,  and 
hardware  executives/voters.  This  task  should  implement  the  most  unique  concept  and  then 
evaluate  the  results. 

2.3  No-voter  redundant  microcomputers.  Through  a  series  of  bus  data  interchanges,  a 
set  of  microcomputers  can  detect  and  isolate  failed  microcomputers  in  the  configuration. 

This  task  will  analyze  and  develop  protocol,  data  checking,  and  control  procedures  that  allow 
microcomputer  redundant  configurations  without  voting  circuits.  The  analysis  must  address 
reliability  in  quad-redundant  configurations  and  compare  no-voter  designs  against  designs 
which  employ  conventional  voting  circuits. 

2.4  Dedicated  redundant  microcomputers.  Microcomputers  will  be  employed  in  dedi¬ 
cated  configurations,  in  which  dedicated  implies  that  the  microcomputers  implement  an 
algorithm  that  remains  unchanged  throughout  the  life  of  a  system.  This  task  will  explore  the 
use  of  redundant  microcomputers  in  achieving  fault  tolerance  for  dedicated  computer  con¬ 
figurations.  It  will  also  consider  bus  structures,  interfacing,  detection  techniques,  and  other 
design  considerations  in  developing  a  sound  design  methodology. 

2.5  Redundant  microcomputer  methodology.  The  previous  tasks  will  be  assembled  into 
a  structured  design  baseline  for  future  system  design  using  redundant  computers.  It  must 
represent  a  complete  description  of  the  design  methodology  from  operational  requirements 
to  specifications,  design,  and  system  acceptance  test. 

3.0  Scope.  This  is  a  highly  specialized  and  critical  project  because  the  microcomputer  is 
constantly  introducing  new  design  concepts.  Each  of  the  tasks  must  be  addressed  in  a  manner 
that  results  in  new  innovative  concepts. 

4.0  References 

•  The  STAR  (Self  Testing  and  Repairing)  C  omputer:  An  Investigation  of  the  Theory 
and  Practice  of  lault  Tolerant  C  omputer  Design.  A  Avi/.ienis,  GC  Gilley,  KP 
Matluis,  DA  Rennels,  JA  Rohr,  and  DK  Rubin,  IEEE  Transactions  on  Computers, 
vol  C-20,  no  1  I .  November  1 97  1 ,  p  131 2-132 1 
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5.0  Background.  The  advent  of  the  one-chip  microcomputer  has  opened  possibilities  of 
new  approaches  for  redundant  computer  configuration  that  exploit  hardware  executives, 
simplified  voting  circuits,  and  reduced  switch  complexities.  Use  of  redundant  microcompu¬ 
ters  is  now  considered  cost-effective  and  feasible  for  broad  classes  of  problems.  This  project 
will  formalize  the  approaches  to  applying  redundant  computers  from  analysis,  design,  and 
verification. 


LOOSELY  COUPLED  FAULT-TOLERANT  COMPUTER  NETWORKS 


1 .0  Introduction  (objective).  The  objective  of  this  project  is  to  develop  a  theory  that  will 
address  problems  of  loosely  coupled  computer  networks.  Loosely  coupled  computers  are 
those  between  any  two  of  which  there  is  only  one-way  communication.  This  implies  that  the 
receiving  machine  must  always  be  ready  to  receive  data  (otherwise  the  data  will  be  lost)  and 
be  able  to  error  detect  and/or  correct  data. 

2.0  Approach.  By  use  of  directed  graph  theory,  procedures  will  be  developed  to  calculate 
the  fault  tolerance  of  a  loosely  coupled  computer  network. 

3.0  Scope.  This  project  will  use  graph  theory  coding  and  computer  design  specialists. 

4.0  References 

5.0  Background.  Often  a  system  -  for  example,  a  command  control  system  -  has  long 
distances  between  its  various  subsystems.  Because  of  the  long  distances,  the  communication 
paths  have  large  delays  in  them.  If  a  message  is  not  received  correctly,  it  is  impractical  to 
immediately  notify  the  sender  and  retransmit  the  message.  Also,  there  is  no  immediate 
feedback  to  the  sender  for  verifying  the  message  was  sent  correctly.  This  constraint  on  a 
system  presents  problems  that  may  change  system  design  in  order  to  make  it  fault-tolerant. 
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FAULT  TOLERANCE  MASKING  HAZARDS 


1 .0  Introduction  (objective).  The  purpose  of  this  project  is  to  analyze  and  develop  tech¬ 
niques  for  addressing  fault  tolerance  systems  design  and  use  of  redundancy.  The  goal  of  this 
project  will  be  to  investigate  contemporary  problems  associated  with  redundancy,  in  particu¬ 
lar  those  associated  with  masking  (or  static  redundancy),  and  develop  a  set  of  design  guidelines 
to  help  system  designers  avoid  problems  in  designing,  using  static  redundancy. 

2.0  Approach.  This  project  will  consist  of  the  following  tasks: 

a.  Develop  a  listing  of  design  features  for  system  designs  that  use  static  redundancy 
to  achieve  fault  tolerance. 

b.  Using  the  list  of  design  features  as  a  reference,  develop  a  set  of  design  goals  for  a 
fault-tolerant  system  that  will  use  static  redundancy. 

c.  From  the  design  goals,  prepare  a  document  that  gives  the  design  information  for 
meeting  design  goals. 

3.0  Scope.  The  scope  of  this  project  is  to  take  the  theory  given  in  the  references  and 
then  adapt  it  to  information  that  can  be  used  by  a  system  designer.  The  personnel  perform¬ 
ing  this  project  require  both  theoretical  and  practical  systems  design  backgrounds. 

4.0  References 

•  Fault-Tolerant  Computing:  An  Introduction,  A  Avizienis,  UCLA,  Los  Angeles, 
January  22,  1977.  Work  supported  by  NSF  Grant  no  MCS  72-03633  A04 

•  The  Attainment  of  Reliable  Digital  Systems  Through  the  Use  of  Redundancy  -  A 
Survey,  RA  Short,  IEEE  Computer  Group  News,  vol  2,  no  2,  p  2-17,  March  1968 

•  Mathematical  Theory  of  Reliability,  RW  Barlow  and  F  Proschan,  New  York,  Wiley 
and  Sons,  1965 

•  Probabilistic  Logics  and  Synthesis  of  Reliable  Organisms  from  Unreliable  Compo¬ 
nents,  J  Von  Neuman,  CE  Shannon,  and  J  McCarthy,  ed.  Annuals  of  Math  Studies 
No  34,  Princeton  University  Press.  1956,  p  43-98 

•  Real  Time  Fault  Detection  for  Small  Computers,  JR  Allen  and  SS  Yau,  AFIPS 
Conference  Proc,  vol  40,  1972 

5.0  Background.  In  a  fault  susceptible  system,  fault  tolerance  can  be  implemented 
through  the  use  of  protective  redundancy  that  will  functionally  take  action  when  a  specified 
fault  occurs. 

One  method  of  fault  tolerance  redundancy  that  has  been  employed  is  fault -masking 
by  using  redundancy  in  such  a  fashion  that  the  fault  will  be  completely  contained  within  a 
system  module.  As  long  as  the  built-in  redundancy  has  not  been  exhausted,  no  symptoms 
will  appear  outside  the  module.  However,  when  the  redundancy  has  been  exhausted  or  over¬ 
whelmed  by  a  fault  module,  failure  will  result.  Separate  detection  and  recovery  functions 
are  not  identified  from  outside  the  module.  Thus,  masking  is  often  called  a  static  redundancy 
technique. 

The  use  of  static  hardware  redundancy  is  based  on  the  assumption  that  failures  in  the 
redundant  copies  are  independent.  For  this  reason  the  use  of  static  redundancy  is  difficult 
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to  justify  within  integrated  circuit  packages,  due  to  the  likelihood  of  failures  affecting  several 
adjacent  circuits  or  components. 

The  forms  of  static  redundancy  that  have  been  applied  to  space  program  computers 
have  been  the  application  of  replication  of  individual  electronic  components  and  Triple  Mod¬ 
ular  Redundancy  (TMR)  with  voting.  Several  other  applications  have  been  studied,  but  never 
implemented  because  of  the  excessive  cost  of  special  components  that  were  beyond  the  state 
of  the  art. 

Some  disadvantages  of  fault  masking  redundancy  are: 

a.  Excessive  costs  associated  with  massive  replication  (three,  four,  or  more  times  the 
basic  system  elements). 

b.  The  requirement  that  assumes  independent  failures  of  the  replicas. 

c.  The  absence  of  warning  when  a  module  exhausts  its  redundancy  and  finally  will 

fail. 

Fault  masking  is  similar  to  fault  avoidance.  While  it  may  postpone  the  time  of  failure, 
the  module  will  still  fail  suddenly  and  irrevocably  when  its  redundancy  is  exhausted. 

Regardless  of  the  disadvantages,  masking  is  a  popular  approach  because  of  conceptual 
simplicity,  instant  reaction  to  a  fault,  and  the  fact  that  it  is  entirely  transparent  to  the  user, 
in  particular,  a  promising  area  of  application  is  protecting  a  “hard  core”  for  which  other 
approaches  are  prohibitively  costly. 
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FAULT  TOLERANCE  DESIGN  HANDBOOK 


1 .0  Introduction  (objective).  The  summation  of  technological  data  on  fault  tolerance 
technology  should  be  contained  in  a  fault  tolerance  design  handbook,  and  this  task  will 
develop  the  handbook  to  assume  the  widespread  dissemination  of  data. 

2.0  Approach.  This  program  is  a  continuous  collection  and  assessment  of  technology 
maintained  by  the  military  for  use  by  industry  in  system  design. 

3.0  Scope.  This  project  is  conducted  to  achieve  a  handbook  useful  to  both  the  system 
engineer  and  designer. 

4.0  References 


•  The  Design  of  Totally  Self-Checking  Systems,  David  Su  Ming  Ho,  Coordinated 
Science  Laboratory,  Report  R723,  April  1976,  UILU-ENG7 6-221 1 ,  University 
of  Illinois 

•  General  Design  Rules  for  the  Construction  of  m-out-of-n  Totally  Self-Checking 
Checkers,  James  E  Smith  and  Gemot  Metze,  Coordinated  Science  Laboratory 
Report  R-693,  University  of  Illinois  UIL  -  ENG  -  75-2228,  October  1975 


5.0  Background 


DESIGN/DEVELOPMENT  TOOLS  (METHODOLOGY) 


1.0  Introduction  (objective).  Design  and  development  of  complex  fault-tolerant  systems 
involve  many  procedures,  techniques,  and  methods  that  can  be  automated.  In  particular, 
tools  can  be  automated  for  generation  of  tests,  documentation,  circuit  simulation,  and  other 
functions.  This  project  will  address  the  development  of  the  design  and  development  method¬ 
ology  and  the  tools  required  to  support  it. 

2.0  Approach.  Many  automated  design  and  development  tools  will  be  included  in  this 
project.  The  key  task  is  the  development  of  an  overall  design  and  development  methodology. 
This  task  must  be  performed  in  consonance  with  other  tasks  outside  the  area  of  fault  toler¬ 
ance  technology. 

3.0  Scope.  This  is  a  project  which  should  naturally  be  included  within  the  realm  of  design 
automation  and  methodology. 

4.0  References 

5.0  Background.  Presently  there  are  no  methods  (or  tools)  available  to  the  system 
designer  for  developing  fault  tolerance  system  design.  There  is  a  need  to  assemble  procedures, 
techniques,  and  methods  to  assist  the  system  designer. 
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ACCEPTANCE  TESTING 


1 .0  Introduction  (objective).  The  goal  of  this  project  is  to  develop  acceptance  testing 
cedures  for  fault-tolerant  systems.  These  procedures  need  to  address  testing  for  transient 
faults,  permanent  faults,  and  masked  faults. 

2.0  Approach.  The  following  tasks  are  to  be  accomplished  for  this  project. 


2.1  Review  of  standards.  Review  existing  military  standards  on  acceptance  testing  and 
list  those  tests  that  may  change  when  applied  to  testing  a  fault-tolerant  system. 

2.2  Document  changes.  Write  recommended  changes  to  the  military  standards  including 
the  rationale  for  making  the  changes. 

3.0  Scope.  This  project  will  be  involved  with  all  types  of  acceptance  testing  and  contract 
conditions. 

4.0  References 

•  MIL-STD-781D,  Reliability  Tests  Exponential  Distribution,  1 5  November  1967 

•  Project  Engineering,  Hajek,  VG,  McGraw-Hill  Book  Co,  New  York,  1965 

5.0  Background.  Fault-tolerant  systems  continue  to  operate  normally  even  though  there 
has  been  an  internal  failure.  Therefore,  the  normal  methods  of  testing  will  not  detect  this 
type  of  failure;  however,  the  system  is  still  degraded.  For  example,  when  one  of  two  redun¬ 
dant  paths  has  failed,  the  next  failure  will  shut  down  the  system.  Because  of  this,  it  will  take 
special  testing  procedures  to  meet  the  fault-tolerant  specifications. 


TECHEVAL  -  OPEVAL  TECHNIQUES 


1 .0  Introduction  (objective).  The  technical  and  operational  evaluation  of  fault-tolerant 
capabilities  is  critical  to  the  development  of  new  systems.  Yet,  the  test  and  evaluation  is 
difficult  to  accomplish  after  system  design.  The  purpose  of  this  project  is  (a)  to  assess  the 
effectiveness  of  T&E  after-design,  and  (b)  to  indicate  techniques  to  perform  more  effective 
fault  tolerance  systems  test  and  evaluation. 

2.0  Approach.  In  concert  with  recognized  test  and  evaluation  agencies,  this  task  will  pro 
vide  an  assessment  of  techniques  for  fault  tolerance  test  and  evaluation.  It  will  recommend 
new  techniques  for  improvement  of  test  and  evaluation  for  fault-tolerant  systems. 

3.0  Scope.  This  project  should  be  concerned  with  the  additional  complexity  of  systems 
that  results  from  fault  tolerance  design. 


References 


5.0  Background 


SECTION  Bl.  CURRENT  RESEARCH 


Title 

Responsible  organization 

Performing  organization 
Contraet/grant 

Principal  investigator 
Period  of  performance 
Objective 

Approach 


Highly  Reliable  Civil  Aircraft  Computer  Technology 

NASA,  Aeronautics  and  Space  Technical  Office,  Langley 
Research  Center 

NASA,  Langley  Research  Center,  Hampton,  VA 

505-07-3:7670104 

505-07-31:7570105 

501-23-31:7470028 

Stitt,  JE,  Graves,  GB 

7/75  to  6/76;  7/74-6/75;  7/73-6/74 

Initiate  the  logic  design  of  two  advanced  fault-tolerant 
computer  architectural  concepts 

Formal  proofs  of  design  will  be  developed  to  prove  cor¬ 
rect  fault  recovery  strategies.  Procedures  for  obtaining 
data  inputs  for  reliability  assessment  tools  will  be  devel¬ 
oped,  and  reliability  assessments  will  be  performed  for  the 
fault-tolerant  computer  system  designs.  Software  faults 
and  their  impact  on  systems  reliability  will  be  investigated. 
In-house  investigations  of  off-the-shelf  computer  systems 
will  be  performed  to  gather  data  for  determining  that 
reliability  improvements  and  fault  tolerance  are  reliable. 
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Distributed  Computing  for  Real-Time  Information 
Processing 


Title 

Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Naval  Air  Systems  Command.  Air  d(>0.  Comm.  Control 
and  Guidance.  Washington.  DC 

Zempolich,  BA 

NELC,  San  Diego.  CA 

In-house 

Wong,  I  IF 

5/76-9/78 

Quantify  the  All  Applications  Digital  Computer  (AADC) 
parameters  required  for  a  low-cost,  fault-tolerant,  real¬ 
time  tactical  system. 

Based  on  work  performed  for  the  office  of  Naval  Research 
by  UCLA  and  NELC  on  distributed  functions  architecture 
and  other  related  recent  work,  develop  a  distributed  archi¬ 
tecture  model  for  a  real-time  tactical  system  such  as 
NAVSEA’s  Surface  Offensive  Warfare  Command  Control 
System.  Exercise  the  model  and  determine  the  parameters 
for  processor  partitioning,  module  interconnection,  and 
intersystem  interfaces. 


Title- 


Adaptive  Engagement  Logic  and  Control  Studies 


Responsible  organization  DA  BMD  Advanced  Technology  Center.  ATTN:  ATC-C. 

PO  Box  1500,  Huntsville,  AL 

Responsible  individual  Smith,  TO 

Performing  organization  TRW,  Inc.,  TRW  Systems  Group,  One  Space  Park, 

Redondo  Beach.  CA 

Systems  Control.  Inc.  1801  Page  Mill  Rd.  Palo  Alto,  CA 

C'ontract/grant  DASG60-7  6-C-0084 

Principal  investigator  Spadaro,  FG  (TRW) 

Spain,  DS  (Sys  Contr,  Inc) 

Period  of  performance  4/76-1  1  /7  5 

Objective  The  objective  of  this  program  was  to  formulate  and 

develop  advanced  resource  allocation  and  control  tech¬ 
niques  which  will  produce  orders  of  magnitude  of 
defense  performance  improvement.  These  advanced 
techniques  should  provide  a  wide  range  of  optimal 
defense  response  to  a  broad  spectrum  of  innovative 
offensive  strategies  and  must  consider  projected  tech¬ 
nology  advances  in  associated  disciplines,  such  as  sensors 
and  interceptors.  This  effort  extends  the  state  of  the  art 
in  BMD. 

Approach  The  approach  that  was  used  to  accomplish  the  program 

objectives  included:  ( 1 )  conducting  a  requirements 
analysis  study  and  defining  resource  allocation  and  con¬ 
trol  algorithm  design  requirements  for  both  ground-based 
and  missileborne  applications;  (2)  conducting  a  survey  of 
advanced  information  processing  techniques  to  include 
pattern  recognition,  fault-tolerant  logic,  artificial  intelli¬ 
gence,  etc:  (3)  performing  an  adaptive  engagement  logic 
and  control  algorithm  design  study;  and  (4)  producing  a 
simulation  (test  bed)  design  that  contains  design  features 
sufficient  to  demonstrate  the  validity  of  the  approach  for 
accomplishing  the  objectives  of  the  program. 

Progress  Information  processing  techniques  with  potential  applica¬ 

bility  to  BMD  control  problems  were  identified,  and 
examples  of  their  use  in  a  BMD  context  were  provided. 
Results  are  documented  in  the  final  report  entitled  Adap¬ 
tive  Engagement  Logic  and  Control  Studies  dated 
November  1075. 
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Title 


Fault-Tolerant  Computing 


Responsible  organization  US  National  Science  Foundation,  Division  of  Mathe¬ 

matics  and  Computer  Sciences 

Performing  organization  University  of  California,  School  of  Engineering  and 

Applied  Sciences,  Computer  Sciences,  Los  Angeles,  CA 

Contract/gran t  MC572-03633,  A00  through  A04 

Principal  investigators  Avizienis,  AA;  Chu,  WW;  Martin,  DF 

Period  of  performance  Fourth  Year  2/76-1/76 

Fifth  and  final  year  3/76-4/77 

Objective  A  5-year  continuing  research  effort  focused  on  fault 

tolerance  of  computer  systems. 

Approach  This  project  is  a  continuing  research  effort  focused  on 

fault  tolerance  of  computer  systems.  A  fault-tolerant 
computer  executes  its  entire  set  of  programs  correctly  in 
the  presence  of  faults  in  the  computer  system.  Faults 
which  occur  in  a  computer  system  fall  into  two  major 
categories:  hardware  faults,  which  include  all  deviations 
from  design-specified  values  of  logic  variables  within  the 
hardware  of  the  computer:  and  software  faults,  which 
include  all  deviations  from  correct  program  execution 
due  to  errors  occurring  during  the  translation  of  the 
original  specification  of  an  algorithm  to  the  program 
being  executed. 

The  primary  topics  for  study  during  the  fourth  year  were 
Methodology  of  Fault  Tolerance,  Design  and  Modeling 
of  Fault-Tolerant  Systems,  Multiaccess  Memories  and 
Distributed  Data  Base  Systems,  Program  Correctness,  and 
Software  Reliability  and  Certification. 

The  primary  topics  for  study  during  the  fifth  and  final 
year  of  the  grant  included  Methodology  of  Fault- 
Tolerance,  Design  and  Modeling  of  Fault-Tolerant  Com¬ 
puter  Systems,  and  Software  Reliability. 


Responsible  organization 


Performing  organization 


Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 


Approach 


Organization  and  Technology  —  Oriented  Codes  for 
Arithmetic  and  Memory  Systems 

US  National  Science  Foundation,  Division  of  Engineering 

Southern  Methodist  University,  School  of  Engineering, 
Electrical  Engineering,  Dallas.  TX 

ENG76-1 1237 

Rao.TR 

5/76-/77 

Investigate  codes  and  coding  techniques  for  fourth- 
generation  computer  arithmetic  and  memory  systems. 

The  study  is  directed  towards  investigating  codes  and 
coding  techniques  for  fourth-generation  computer  arith¬ 
metic  and  memory  systems.  Specifically,  the  techniques 
are  to  be  technology-oriented  and  organization-oriented 
for  the  purpose  of  fault  tolerance,  improved  performance, 
and  reliability  of  operation.  In  addition,  coding  tech¬ 
niques  which  improve  the  yield  of  LSI  memory  chips 
and  thus  lower  the  cost  per  chip  by  a  significant  factor 
are  studied.  An  important  and  challenging  problem 
posed  here  is  to  determine  the  best  (from  the  cost- 
performance  point  of  view)  approaches  for  “highly  sur- 
vivable  memory”  organizations  and  “byte-organized 
arithmetic  processors.”  The  most  significant  gain  to  be 
expected  from  coding  is  that  with  a  given  level  of  semi¬ 
conductor  technology,  it  will  be  possible  to  produce 
coded  memory  arrays  of  much  larger  size  than  would  be 
economically  feasible  in  the  uncoded  case. 
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Responsib le  orga niza t ion 


Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 


Objective 


Approach 


Information  Management  System 

NASA,  Manned  Space  Flight  Office,  Marshall  Space 
Flight  Center,  Huntsville,  AL 

NASA,  Marshall  Space  Flight  Center,  Huntsville,  AL 

910:7670706 

Panciera,  RE 

7/75-6/76 

The  objective  of  the  Space  Ultrareliable  Modular  Com¬ 
puter  effort  was  to  develop  low-cost,  reliable  aerospace 
modular  computers  and  computer  input/output  hard¬ 
ware  which  are  applicable  to  NASA’s  future  payloads. 

The  main  thrust  of  this  effort  was  to  inject,  into  the 
current  simplex  design,  adequate  reliability  by  the  incor¬ 
poration  of  automated  fault  detection  and  correction 
designs;  perform  qualification  testing;  and  continue  to 
develop  support  software.  The  development  of  an 
internally  fault-tolerant  computer  must  be  undertaken 
for  timely  integration  of  future  payload  information 
management  subsystems  such  as  the  tug.  The  objective 
of  the  information  management  effort  was  to  continue 
to  define,  develop,  and  space  qualify  an  on-board  Infor¬ 
mation  Management  System  (IMS)  which  can  best 
accomplish  the  requirements  of  the  space  tug  missions. 
Areas  in  which  analysis  and  technology  developments  are 
required  will  be  identified.  The  requirements  for  mem¬ 
ory  will  be  determined.  A  computer  previously  developed 
under  this  effort  and  computer  input/output  hardware 
with  special  monitoring  equipment  currently  under  devel¬ 
opment  will  be  provided  for  use  in  the  design,  develop¬ 
ment,  and  test  of  a  redundant  laser  gyro  inertial  measur¬ 
ing  unit. 
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Title 


Symposium  on  Computer  Software  Engineering 


Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 


Air  Force  Office  of  Scientific  Research,  NE,  Bolling  AFB. 
Washington,  DC 

Knausenberger,  OF 

Polytechnic  Institute  of  New  York,  Brooklyn,  NY 
AF-AF-SR-289 1  -75 
Oliner,  AA 
6/75-6/77 

The  twenty-fourth  symposium  of  the  Polytechnic  Institute 
of  New  York  scheduled  for  20-22  April  1976  on  the 
subject  of  Computer  Software  Engineering  focuses  atten¬ 
tion  on  new  fields  of  electronics  research.  Computer 
systems  development  depends  significantly  on  improve¬ 
ments  in  planning,  specification,  design,  acceptance,  and 
deployment  of  software.  Its  relationships  to  computer 
operational  reliability  and  availability  must  be  defined, 
measured,  and  predicted.  Related  software  research  ren¬ 
ders  suitable  probabilistic  models;  also,  design  techniques 
such  as  structured  programming,  team  programming,  and 
code  reading  promise  more  reliable  initial  designs.  Broad 
fields  of  testing  and  analysis  of  programs  and  of  compu¬ 
tational  operations  and  computer  element  utilization 
begin  to  emerge.  The  planned  symposium  will  provide 
an  international  forum  for  the  leaders  in  these  areas  to 
present  their  developing  ideas,  to  interact,  and  (via  the 
proceedings)  to  provide  lasting  documentation.  The  pro¬ 
gram  being  considered  includes:  in  the  area  of  reliability, 
topics  such  as  program  proofs  and  test  models,  reliability 
estimation  and  demonstration,  operational  reliability 
measurement,  quality  control  measures,  software  main¬ 
tenance  models,  and  fault-tolerant  software;  in  the  area 
of  software  management,  prerequisites  and  tools  to  aid 
program  construction,  configuration  control,  productivity, 
and  planning  anil  scheduling  of  programming;  in  the  area 
of  design,  problems  of  measures  ol  program  complexity 
and  information  processing  flow,  of  analysis  of  run  time, 
calculation  of  memory  size,  and  hardware/software 
tra feoff.  The  symposium  is  sponsored  by  the  Air  Force 
Office  of  Scientific  Research.  Army  Research  Office, 
and  the  Office  of  Naval  Research  under  the  Joint  Services 
Electronics  Program  (JSEP)  in  cooperation  with  the 
Polytechnic  Institute  of  NY.  It  has  been  assured  of  the 


technical  support  of  the  IEEE  Computer  Society;  also, 
the  participation  of  the  IEEE  Group  on  Reliability  and 
of  the  Association  for  Computing  Machinery  is  being 
arranged. 

30  June  75  -  30  June  76.  The  symposium  on  computer 
software  engineering  took  place  April  20  to  22  in  New 
York  with  an  attendance  of  about  250  professionals  and 
with  sessions  on  design  techniques  (three  papers), 
requirements  (five  papers),  models  and  data  (six  papers), 
theory  ( five  papers),  languages  (four  papers),  reliability 
(seven  papers),  and  case  studies  (five  papers).  Preparation 
of  proceedings  is  in  progress. 


Title 


Theory  of  Fault  Tolerance 


Responsible  organization 
Responsible  individual 
Performing  organization 

Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 


Approach 


Progress 


Office  of  Naval  Research,  Arlington,  VA 
Denicoff,  MD 

Honeywell,  Inc,  Systems  Research  Division, 

Minneapolis,  MN 

N000 14-75-C-OOl  1 

Heimerdinger,  WL 

8/74-1  1/76 

As  more  and  more  functions  are  given  to  computers,  the 
need  of  the  Navy  for  ultrareliable  computers  grows  more 
acute.  The  general  area  of  fault  tolerance  is  of  great 
importance.  To  date,  no  unified  theory  of  fault  tolerance 
exists  that  enables  one  to  evaluate  fault  tolerance  schemes, 
to  make  comparisons,  or  to  determine  design  improve¬ 
ments  for  such  devices.  This  work  was  aimed  at  the 
development  of  a  systematic  analytic  framework  to 
accomplish  these  important  tasks. 

Several  graph  modeling  techniques  have  been  identified 
as  possible  tools  to  use  for  this  development.  Specific- 
tasks  for  this  contract  period  are  (1)  to  investigate  the 
critical  properties  of  data  which  must  be  incorporated 
into  the  graphical  representation,  in  order  to  correctly 
model  data  contamination;  (2)  to  clarify  the  distinction 
between  data  and  control  and  formalize  their  interaction 
in  the  context  of  graphical  techniques;  and  (3)  to  investi¬ 
gate  the  concept  of  hierarchy  in  labeled  graphs. 

Progress  to  date  has  shown  a  systematic  and  mechanizable 
technique  for  reducing  a  Petri  net  to  a  smaller  Petri  net, 
such  that  the  smaller  net  preserves  some  properties  of  the 
original.  The  same  systematic  method  can  also  verify  that 
a  Petri  net  possesses  those  desired  properties.  In  order  to 
reduce  them  systematically,  il  is  necessary  to  identify 
some  desirable  properties  of  the  nets  and  to  determine 
mechanically  whether  a  net  possesses  them.  The  desirable 
properties  are  primarily  consistency  and  invariance  and 
t Heir  relations  with  liveness,  boundedness,  and  transition 
tiring  ratios.  Refer  to  Heimerdinger,  WL.  and  Jack.  LA. 

A  (iraph  Theoretic  Approach  to  Fault-Tolerant  Comput¬ 
ing,  1975-76  Annual  Report,  22  March  1976. 


Title 


Galois  Logic  for  Fault  Tolerance  Detection  in  Computer 
Systems 


Responsible  organization  Office  of  Naval  Research,  Arlington,  VA 

Responsible  individual  Denicoff,  MD 

Performing  organization  Sperry  Rand  Corp,  UNIVAC  Defense  Systems  Division, 

St  Paul,  MN 

Contract/grant  N000 1 4-7  5-C -0675 

Principal  investigator  Marver,  JM 

Period  of  performance  7/74-6/76 

Objective  There  is  a  growing  need  in  the  Navy  for  more  reliable 

computer  systems.  The  general  area  of  fault  tolerance 
is  of  great  importance  in  the  effort  to  produce  computing 
equipment  with  very  high  reliability.  The  mathematics 
of  Galois  logic  is  uniquely  suited  to  systems  which  can  be 
easily  reconfigured  to  bypass  certain  components,  thus 
tolerating  faults  in  the  system.  This  effort  investigates 
Galois  logic  to  apply  its  unique  properties  in  the  design 
of  highly  reliable  computer  hardware. 

Approach  The  contractor  will  investigate  several  configurations  of 

Galois  linear  modules;  several  different  error  classes  such 
as  multiple  bit  errors  in  the  same  variable;  and  the  feasi¬ 
bility  of  a  particular  method  using  Galois  test  polynomials 
for  error  detection,  location,  and  suppression.  The 
method,  if  feasible,  will  be  generalized  to  larger  arrays 
of  Galois  linear  modules. 

Progress  Fault  detection  and  location  algorithms  have  been  devel¬ 

oped  for  simple  trees  of  Galois  linear  modules.  They 
cover  the  4-bit  case  and  have  been  extended  to  the  8-bit 
and  the  16-bit  cases.  However,  much  additional  work 
remains  to  be  done. 
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Title 

Responsible  organization 
Responsible  individual 
Performing  organization 

Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Distributed  information  Processing  Systems  Architecture 
Office  of  Naval  Research,  Arlington,  VA 
Trimble,  J 

NELC,  Communications  Processing  Division,  San  Diego, 
CA 

In-house 

Wong,  HF  (Code  3200) 

10/74-9/79 

Investigate  the  problem  of  control  for  a  distributed  infor¬ 
mation  processing  architecture,  using  minicomputers  and 
microprocessors  that  can  readily  adapt  to  and  recover 
from  processor  or  bus/link  failure. 

Study  distributed  processing  architecture.  Study  and 
analyze  minicomputer  and  microprocessor  interconnec¬ 
tion  techniques.  Examine  relevant  data  management  and 
retrieval  strategies.  Analyze  the  asynchronous  bus  struc¬ 
ture  for  processing  element  interconnection.  Examine 
alternatives  for  an  optimal  distributed  information  system 
architecture  relationship  to  the  fault  tolerance  of  the 
architecture. 


Progress 


A  basic  system  bus  structure  has  been  described  and 
allocation  protocols  have  been  defined  for  it. 


Title 


Command  Control:  Fault-Tolerant  Input/Output 
Network 


Responsible  organization 
Responsible  individual 
Performing  organization 
Contract/grant 


Office  of  Naval  Research,  Arlington,  VA 
Denicoff,  MD 

Charles  Stark  Draper  Laboratory,  Cambridge,  MA 

N000 14-7  6-C-0502 
N000 14-7  5-C -0500 


Principal  investigator 
Period  of  performance 
Objective 


Approach 


McKenna, JF 
7/74-6/76;  3/76-2/77 

The  Navy  has  a  need  for  highly  reliable  computer  systems. 
The  development  of  fault-tolerant  theory  and  schemes 
for  implementation  is  one  important  part  in  the  develop¬ 
ment  of  ultrareliable  systems.  One  key  subsystem  in  any 
computer  system  is  the  input/output  network.  This  task 
is  directed  toward  the  development  of  such  a  fault- 
tolerant  subsystem. 

A  fault-tolerant  network  is  being  designed  and  fabricated 
to  serve  as  the  I/O  network  for  a  multiprocessor  system. 
The  I/O  network  is  an  alternative  to  I/O  buses.  Nodes 
of  the  network  are  used  to  route  information.  Configura¬ 
tion  control  is  directed  by  a  centralized  processor.  Con¬ 
trol  algorithms  are  implemented  in  software.  This  task 
involves  evaluation  of  this  network  concept  and  develop¬ 
ment  and  evaluation  of  the  necessary  software.  Such  a 
network  should  provide  both  an  economically  viable 
alternative  to  a  data  bus  and  a  significant  improvement 
in  reliability  and  survivability,  particularly  to  physical 
damage. 
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Title 


Distributed  Processor/Memory  Architectures 


Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Air  Force  Avionics  Laboratory  AAM,  Wright  Patterson 
AFB,  OH 


Texas  Instruments  Inc,  Dallas,  TX 
DF096630:F3361 5-74-C-1018 
Consolver,  G 
7/74-6/75 

Obtain  detailed  functional  design  from  the  distributed 
processor  memory  computer  concept  definition  including 
fault-tolerant  capability,  modular  expandability,  and 
basic  LSI  compatible  iterative  structure. 

Obtain  executive  control  and  application  program  base 
to  support  the  system.  Simulate  the  DP/M  network  at  a 
functional  level  to  verify  the  concept.  Implement  the 
flow  chart  design  of  the  executive  control  program. 

Design  an  automated  “Process  Constructor”  procedure 
for  allocating  applications  programs  to  DP/M  processors. 
Begin  design  of  an  HOL  compiler  compatible  with  the 
DP/M  architecture. 


Title 


Command  Control:  Advanced  Approaches  to  Computer 
System  and  Component  Design 


Responsible  organization 
Responsible  individual 
Performing  organization 

Contract/grant 
Principal  investigator 
Period  of  performance 
Objectives 

Approach 


Office  of  Naval  Research,  430C,  Arlington,  VA 


University  of  California,  School  of  Engineering,  Los 
Angeles,  CA 

DN723247:N00014-69-A-0200-4027 

Ohu,  W 

7/74-6/75 

Continue  research  in  the  areas  of  computer  architecture, 
fault-tolerant  features  of  computer  communications,  and 
memory  systems,  as  these  subjects  are  important  in  Navy 
combat  systems. 

Demonstrate  feasibility  of  new  concepts  with  experimen¬ 
tal  hardware  to  make  empirical  measurements  and  to 
make  comparisons  with  theoretically  predicted  results. 

In  order  to  thoroughly  test  experimental  devices,  simu¬ 
lators  have  been  built  and  monitor  programs  written. 


Title 


Adaptime  Information  Processing  and  Control  Studies 


Responsible  organization 

Responsible  individual 
Performing  organization 

Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Progress 


Department  of  the  Army,  Advanced  Technology  Center, 
Attn:  ATC-C,  Huntsville,  AL 

Tubbs,  L 

Leland  Stanford  Jr  University,  c/o  sponsored  projects 
office,  Stanford,  CA 

OASG60-75-C-0095 

Widrow,  B 

6/75-11/76 

Perform  basic  research  in  the  information  sciences  and 
demonstrate  how  advanced  information  processing  tech¬ 
niques  might  be  applied  to  solving  practical  engineering 
problems  encountered  in  developing  real-time  algorithms 
for  advanced  sensors  and  data  processing  systems.  This 
effort  will  extend  the  state  of  the  art  in  BMD. 

The  approach  includes  identifying  and  analyzing  practical 
engineering  problems  encountered  in  developing  real¬ 
time  data  processing  techniques  for  information  process¬ 
ing  and  control  of  advanced  passive  and  active  sensors. 
Consideration  shall  be  given  to  sensor  applications  per¬ 
forming  such  functions  as  acquisition,  track,  object  iden¬ 
tification,  and  hand-over  to  other  sensors.  Information 
processing  techniques  to  be  considered  will  include  artifi¬ 
cial  intelligence,  pattern  recognition,  fault-tolerant  logic, 
and  other  techniques  that  may  be  defined  in  the  course 
of  this  study. 

Progress  includes  defining  the  resource  allocation  problem 
in  a  game  theoretic  sense  for  botli  a  terminal  and  area 
defense  system  configuration.  Adaptive  learning  tech¬ 
niques  are  being  applied  to  the  solution  of  this  problem 
and  results  will  be  compared  to  alternative  solution  tech¬ 
niques.  Heuristic  search  and  production  rule  concepts 
have  been  identified  as  candidates  for  detailed 
investigation. 


Title 


A  Graph  Theoretic  Approach  to  Fault-Tolerant 
Computing 


Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 


Approach 


Air  Force  Office  of  Scientific  Research,  Bolling  AFB. 
Washington.  DC 

Bram,  Joseph 

Honeywell,  Inc,  Minneapolis,  MN 
F44620-75-C-0053 
Heimerdinger,  W 
2/75-1/77 

Avionics  systems  are  becoming  the  heart  of  almost  all 
modern  Air  Force  weapon  systems.  The  growing  central 
importance  of  the  avionics  system  to  overall  system  oper¬ 
ation  has  highlighted  the  need  for  a  systematic  approach 
to  avionics  system  design  from  a  fault  tolerance  viewpoint. 
The  objectives  of  this  research  program  are  to  expand 
existing  fault-tolerant  digital  systems  and  to  develop  a 
better  representation  of  data  in  these  systems.  This 
research  will  contribute  to  the  development  of  unified 
theory  of  fault  tolerance  for  analysis  and  synthesis. 

The  influence  of  time  on  the  functional  behavior  of  digital 
systems  and  on  fault  detection  and  fault  recovery  in  them 
will  be  investigated  and  a  formulation  developed  to  reflect 
this  influence  in  the  fault  tolerance  graph  model.  This 
would  include  determining  when  time  is  an  important 
factor,  documenting  the  assumption  implicit  in  the  incor¬ 
poration  of  time  in  the  model,  developing  the  expanded 
model,  and  refining  and  evaluating  the  new  model.  In 
other  work,  the  development  of  a  representation  for  this 
data  in  fault-tolerant  systems  will  be  initiated. 


Title 

Responsible  organization 

Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 


Approach 


Progress 


Maintainability  Prediction  and  Analysis 

Air  Force,  Rome  Air  Development  Center  RBR,  Griffiss 
AFB,  New  York  13440 

Hughes  Aircraft  Company,  Fullerton,  C'A 

DF736770 

Pliska,  T 

7/75-6/76 

Develop  new  maintainability  prediction  and  analysis 
techniques  based  directly  on  the  characteristics  of  equip¬ 
ment  diagnostics  and  using  built-in-reliability  data  and 
time  standards. 

Involved  study  and  further  development  of  the  diagnostic, 
built-in,  and  external  test  equipment  figures  of  merit  such 
as  those  developed  during  this  in-house  phase  of  study. 
Definition  of  quantifiable  means  to  evaluate  or  measure 
each  such  Figure  of  merit,  followed  by  the  integration  of 
such  figures  of  merit  with  other  directly  quantifiable 
components  of  maintainability  to  form  modeling  and 
prediction  relationships.  In  particular,  the  prediction 
technique  involved  relating  each  diagnostic  routine  (test 
circuit  procedure,  ie,  automatic,  semiautomatic,  or  man¬ 
ual)  to  the  LRUs  involved  and  determination  of  the  pro¬ 
portion  of  faults  possible  in  an  LBU  that  will  be  detected 
by  each  diagnostic  routine.  Determining  the  proportion 
of  failures  expected  from  each  LRU  of  an  equipment. 
Defining  the  time  required  to  isolate  a  failure  to  an  LRU 
if  a  diagnostic  (automatic)  does  not.  Defining  diagnostic 
ambiguities  (diagnostic  isolation  to  how  many  different 
LRUs)  and  the  average  time  to  isolate  the  faulted  LRU 
under  that  circumstance.  Time  motion  studies  to  define 
corrective  action  times.  All  these  parameters  are  available 
from  a  design  study  of  A/N  equipment. 

This  model  will  reflect  three  aspects  of  data  handling: 
data  access,  data  attributes,  and  data  tra"  sb>rmation. 

1  February  75  31  January  7 6.  The  aim  ot  this  research 

is  to  develop  a  model,  based  on  the  theory  of  gi.iphs.  for 
the  representation  and  analysis  of  fault-tolerant  systems. 
Such  a  model  could  be  of  immense  value  in  the  design  o: 
future  electronic  hardware,  and  possibly  in  computer 
software.  It  was  determined  that  die  notion  of  IVtri  nets. 


B-l  8 


which  are  extensions  of  labeled  graphs,  would  yield  the 
most  useful  types  of  models  for  fault-tolerant  systems. 
The  fault  phenomena  commonly  seen  in  digital  systems 
were  classified  according  to  their  observable  effects,  and 
it  was  found  that  six  well  defined  functional  fault  classes 
were  enough  to  encompass  most  of  the  faults  that  ever 
arise.  Furthermore,  the  classification  scheme  can  also  be 
applied  usefully  to  software.  Some  problems  with  the 
approach  involve  faults  with  data  items,  faults  involving 
timing  constraints,  and  faults  moving  in  time  into  differ¬ 
ent  classes.  The  efforts  of  the  past  year  concentrated  on 
control-related  fault  phenomena. 


Title 


Large  Scale  Computer  Systems 


Responsible  organization  US  National  Science  Foundation,  Division  of  Mathematics 

and  Computer  Sciences 

Performing  organization  University  of  California,  School  of  Engineering,  Electrical 

Engineering  and  Computer  Sciences,  Berkeley,  CA 

Contract/grant  DCR72-03734-A01 

Principal  investigator  Ramamoorthy,  CV 

Period  of  performance  3/75-2/76;  3/76-2/77 

Objective  Study  large-scale  computing  systems  with  primary  empha¬ 

sis  on  pipeline  processing. 

Approach  Consideration  will  be  given  to  parallel  processing  and  the 

fault  tolerance  and  reliability  of  these  approaches. 

Included  will  be  studies  of  the  detection  of  parallelism 
in  existing  serial  programs,  schedule  and  utilization  of 
parallel  and  pipeline  resources,  the  detection  of  program 
characteristics  (including  parallelism)  which  have  a  marked 
effect  on  the  utilization  of  a  pipeline  system  organization, 
and  parallel  and  pipelined  fault  diagnosis  and  recovery. 
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FAULT-TOLERANT  SVSTEHS  TECHNOLOGY  PROGRAM  PLAN:  H 
CONCENTRATED  PRIORITIZE.  .  <U>  NAVAL  OCEAN  SYSTEMS  CENTER 
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Title 


Reliable  Switching  Circuits  for  Naval  Communications 


Responsible  organization 
Responsible  individual 
Performing  organization 

Contract /grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Progress 


Office  of  Naval  Research,  Arlington,  VA 
Denicoff,  M 

Johns  Hopkins  University,  Dept  of  Electrical  Engineering, 
Baltimore.  MD 

N00014-75-C-1 196 

Masson,  C.M 

6/75-9/76 

Navy  communications  systems,  both  shipboard  and  shore- 
based,  use  very  large  switching  networks.  Presently, 
problems  associated  with  these  networks  include  high 
cost,  individual  switch  characteristics,  and  network  relia¬ 
bility.  This  task  is  aimed  at  studying  the  problems  of 
reliability  and  its  associated  cost. 

The  contractor  will  perform  research  on  the  fault-tolerant 
aspects  of  large-scale  switching  networks.  The  continued 
development  of  general  results  and  analytical  tools  will 
be  stressed.  The  three  areas  of  concentration  for  the 
study  are:  ( 1 )  fault  detection,  location,  and  recovery: 

(2)  measures  of  fault  tolerance  within  switching  networks; 
and  (3)  development  of  simulation  techniques  for  large- 
scale  fault  conditions. 

While  this  effort  has  been  underway  only  a  short  time, 
some  progress  has  already  been  made.  As  the  underlying 
basis  of  the  interconnecting  capability  of  binomial  switch¬ 
ing  and  linking  systems,  a  theorem  has  been  proved  which 
specifies  the  formation  of  a  family  of  subsets  of  a  given 
set  of  elements  for  which  any  subset  of  this  given  set  con¬ 
taining  less  than  or  equal  to  some  fixed  number  of  ele¬ 
ments  is  a  partial  transversal. 


Title 


Design  of  Radiation-Hardened  Digital  Circuits  and 
Systems 


Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Objective 

Approach 


Progress 


AF  Weapons  Laboratory  ELP,  Kirtland  AFB, 
Albuquerque,  NM 

Simon,  Robert  G 

University  of  New  Mexico,  Albuquerque,  NM 

F29601-75-C-004I 

Devries,  Ronald  C 

To  analyze  the  effectiveness  of  two  fault  tolerance  tech¬ 
niques  and  to  develop  other  circuit  hardening  schemes 
for  producing  radiation-hardened  large  scale  integrated 
(LSI)  circuits  for  use  in  military  applications. 

The  effort  is  divided  into  three  phases.  In  the  first  phase 
UNM  proposed  techniques  will  be  analyzed  and  com¬ 
pared  with  respect  to  effectiveness  with  other  fault  toler¬ 
ance  and  error  correction  schemes.  In  the  second  phase, 
radiation-tolerant  circuits  will  be  designed  via  the  most 
effective  combination  of  these  techniques.  In  the  third 
phase,  the  circuits  designed  in  the  second  phase  will  be 
analyzed  by  the  contractor  for  electrical  and  radiation 
performance. 

1  November  75  —  1  May  76.  Data  on  distribution  of 
failures  due  to  radiation  of  assorted  semiconductor 
devices  are  being  assembled.  The  available  data  do  not 
include  as  many  devices  as  had  been  hoped,  but  will 
allow  comparative  failure  analyses  as  intended.  Flash 
X-ray  tests  of  two  sample  circuits  were  completed  and 
analysis  of  results  is  proceeding. 


Title 


Advanced  Software  Quality  Assurance 


Responsible  organization 

Responsible  individual 
Performing  organization 
Contract/grant 
Principal  investigator 
Period  of  performance 
Objective 

Approach 


Army  DA  BMD,  Advanced  Technology  Center, 
Kluntoville,  AL 

Johnson,  BA 

General  Research  Corp,  Santa  Barbara,  CA 

DASG60-76-C-0050 

Milton,  R 

3/76-5/77 

Develop  techniques  that  extend  the  state  of  the  art  in 
BMD  by  generating  and  proper  insertion  of  assertions 
into  a  software  package  to  support  automated  program 
correctness  proofs. 

Identify  faults  in  support  of  fault-tolerant  capabilities. 
They  are  developing  techniques  necessary  to  develop  the 
assertions  related  to  software  reliability. 


SECTION  B2.  ANNOTATED  BIBLIOGRAPHY 

Avizicnis,  A.  Gilley,  GC,  et  at.  The  STAR  (Self-Testing  and  Repair)  Computer:  An  Investiga¬ 
tion  of  the  Theory  and  Practice  of  Fault-Tolerant  Computer  Design,  IEEE  Transactions  on 
Computers,  vol  C-20,  no  1 1 ,  p  1 3 1 2  -  1321,  November  1 97 1 

Abstract 

This  paper  reports  the  results  obtained  in  initial  studies  which  led  to  the  design  and 
construction  of  a  computer  with  dynamic  redundancy  -  called  STAR.  Several  aspects  of 
this  computer  (which  went  into  operation  in  1969)  are  discussed.  Areas  covered  include 
architecture,  reliability  analysis,  software,  automatic  maintenance  of  peripheral  systems,  and 
adaptation  to  serve  as  the  central  computer  in  an  outcrplanet  exploration  spacecraft.  Tech¬ 
niques  of  fault  tolerance  went  into  the  systematic  design  of  STAR  and  were  later  applied  to 
automatic  maintenance  of  the  Thermoelectric  Outer  Planet  Spacecraft  (TOPS). 

Bedrosian,  SD,  Ho,  DeYuan,  Research  on  Fault  Analysis  of  Analog  Circuits,  interim  report. 
Office  of  Naval  Research,  Alexandria,  VA,  August  1976,  contract  N00014-75-C-0768, 
University  of  Pennsylvania 

Abstract 

Work  involves  development  of  inexpensive  alternative  approaches  for  fault  analysis  of 
analog  circuits.  The  authors  show  a  basis  for  the  application  of  fuzzy  set  concepts  and  a  new 
fuzzy  fault  dictionary  approach.  They  want  to  apply  adaptive  learning  techniques  to  augment 
fuzzy  algorithms  and  make  possible  an  “on  line”  procedure  for  fault  isolation  and  location. 

Chiang,  ACG,  Test  Schemes  for  Microprocessor  Chips,  Computer  Design,  p.  87-92,  April 
1975 

Abstract 

Paper  reviews  modern  microprocessor  unit  testing  techniques  in  four  considerations: 
computer  simulation,  signature  testing,  pattern  recognition,  and  pattern  generation.  Advan¬ 
tages  and  disadvantages  of  the  prevailing  techniques  are  discussed  with  emphasis  on  the 
characterization-oriented  techniques  which  fulfill  most  design  needs.  Author  believes  that 
end-users  will  adopt  the  techniques  of  pattern  recognition  and  pattern  gei  eration  because  of 
their  low  cost. 

Collins.  EJ,  Fairtest  -  A  System  for  Computer  Aided  Design  of  Test  Programs  for  Digital 
Logic  Modules.  Fairchild  Systems  Technology,  Palo  Alto.  CA  94304.  no  date,  etc 

Abstract 

Reports  the  development  and  use  of  the  Fairtest  computer  aided  test  generation  sys¬ 
tem  tailored  specifically  for  functional  test  pattern  generation  for  In-ital  logic  modulcv  The 
paper  describes  the  general  nature  and  capability  of  the  modules  winch  make  up  the  system, 
plus  examples  of  application  to  a  macro  model  of  JK  flip-flop  and  to  a  sample  network  which 
uses  that  macro. 


Cottier,  O,  Testing  of  Complex  Digital  Multi-LSI  Array  Assemblies,  RCA 


Abstract 

Appearance  of  high-complexity  digital  PCBs  with  10-20  interconnected  LSI  chips 
and  several  MSI  and  SSI  digital  devices  has  caused  the  surfacing  of  a  problem.  That  problem 
is  the  lack  of  proved  cost-effective  means  to  generate  test  programs  for  these  boards. 

RCA  is  looking  at  establishing  test  design  criteria  and  surveying  existing  test  design 
methods.  Their  approach  is  to  consider  testing  very  early  in  the  generation  of  complex 
printed  circuit  boards. 

Gates,  JS,  Electro-hydraulic  Diagnostic  Equipment,  Chrysler  Corp 

Abstract 

Reports  on  the  development  of  new  fault-isolation  methods  at  the  system  level.  The 
objective  is  to  provide  a  low-cost  Automated  Universal  Hydraulic  Test  Console  configured  for 
electro-hydraulic  diagnostics  and  fault  isolation.  Work  is  a  systematic  approach  to  testing, 
validating,  designing,  and  developing  a  hydraulic  test  console.  Tradeoff  studies  were  con¬ 
ducted  to  determine  the  effectiveness  of  a  computer  approach  vs  a  semiautomated  approach 
(switching  logic)  to  test  circuits. 

Goodman,  DM,  Automatic  Test  Equipment  for  High  Power  Microwave  Tubes  and  Systems, 
Davex  Engineering  Co,  Report  100/3,  San  Diego 

Abstract 

The  purpose  of  the  task  was  to  review  the  reliability  and  maintainability  aspects  of 
high-power  microwave  tubes  and  transmitters;  to  determine  what  sensors  and  data  processors 
could  be  used,  to  improve  the  automatic  fault  diagnostic  capability  in  these  tubes  and  trans¬ 
mitters;  and  to  make  plans  for  carrying  out  the  recommendations  resulting  from  the  study. 
One  of  the  recommendations  was  that  the  Navy  initiate  an  R&D  program  titled  ATE  for  High 
Power  Signal  and  Control  Electronics.  The  system  reported  upon  was  the  AN/AWG-10  air¬ 
borne  fire  control  system  utilized  in  the  F-I4  aircraft. 

Hayes.  JP.  A  Graph  Model  for  Fault-Tolerant  Computing  Systems,  IEEE  Trans  Comput,  vol 
C-25.  no  9,  September  1976 


Abstract 

Paper  presents  a  graph  model  (related  to  Hamiltonian  graph  theory)  which  explicitly 
represents  computing  systems  and  the  algorithms  to  be  executed.  The  graphs  display  the 
computing  facilities  involved  in  particular  computations  as  well  as  the  interconnections  among 
them.  The  model  allows  for  nonprobabilistic  measures  of  fault  tolerance  according  to 
Avizienis’  definition  of  fault  tolerance  “the  ability  to  execute  specified  algorithms  correct¬ 
ly,  regardless  of  hardware  failures  and  program  errors."  The  model  directly  relates  to  the 
problem  of  determining  minimum  configurations  or  structure  required  to  achieve  a  certain 
degree  of  fault  tolerance. 


Hayes,  JP,  On  the  Properties  of  Irredundant  Logie  Networks.  IEEE  Transactions  on  Com¬ 
puters,  vol  C-25,  no  9,  September  1976,  USC 

Abstract 

This  work  investigated  the  constraints  imposed  on  the  structure  of  combinational 
logic  networks  by  various  types  of  irredundancy.  These  were  classified  as:  (1 )  irredundancy, 
(2)  redundancy,  and  (3)  irredundancy  which  relates  to  certain  types  of  network  structures. 
The  results  presented  have  relevance  to  the  problem  of  determining  whether  a  given  network 
contains  redundancy. 

Jack  LA,  Theory  of  Fault  Tolerance,  an  Examination  of  a  Design  for  Testability  Methodolo¬ 
gy,  vol  II,  7  December  1976 

Abstract 

When  compared  to  the  significant  costs  of  hardware  procurement,  the  cost  of  system 
testing  and  support  was  considered  to  be  secondary,  but  this  emphasis  has  changed  because 
of  increasing  system  complexity.  Design  for  testability  has  slanted  towards  a  top-down 
methodological  approach,  which  is  what  Jack  talks  about:  “design  more  testable  systems 
with  minimal  cost  impact.”  The  primary  purpose  of  the  paper  was  to  outline  the  components 
of  such  a  methodology. 

Jack,  LA,  Heimerdinger,  WL,  Han,  YW,  Kinney,  LL,  Theory  of  Fault  Tolerance,  1976  Annual 
Report,  vol  1,7  December  1976 

Abstract 

The  authors  explore  labeled  graph  properties  to  successfully  model  fault-tolerant 
phenomena.  A  modeling  hierarchy  task  demonstrated  that  Petri  net  structures  could  be  re¬ 
duced  while  preserving  their  intrinsic  properties.  Also  investigated  was  a  design  for  a  testabil¬ 
ity  methodology  which  was  identified  as  a  Navy-wide  requirement.  Another  task  was  to 
formalize  the  interface  between  the  data  transformation  structure  and  the  control  structure. 
This  work  indicated  that  a  two-graph  modeling  approach  was  preferred  over  the  single-graph 
approach. 

Jack,  LA,  Heimerdinger.  WL,  Johnson.  MD.  Theory  of  Fault-Tolerance.  1974  -  1975  Annual 
Report.  22  September  1975 

Abstract 

This  report  details  the  effort:,  involved  in  representing  fault-tolerant  phenomena  with 
labeled  graphs.  To  provide  greater  flexibility  over  the  STUCK-AT-ON1  or  STUCK- AT-/ I  RQ 
models,  two  existing  classical  models  were  used  to  develop  a  theoretical  base  for  design  and 
evaluation  of  fault-tolerant  digital  system  >.  These  existing  labeled  graphs  are  LOCOS  and 
Petri  nets.  Results  indicated  that  use  of  these  models  concisely  described  the  in.ncacies  of 
fault  mechanisms. 
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Lovera.  RAP.  On  Detection-Estimation  Schemes  for  Signals  with  Uncertain  Models.  Report 
R-736.  thesis.  July  ll>70.  University  of  Illinois.  Urhana.  II. 

Abstract 

Paper  presents  a  new  approach  for  estimating  a  random  signal  whose  statistical  descrip¬ 
tion  contains  unknown  parameters.  The  joint  detection-estimation  scheme  consists  of  a  bank 
of  estimates  and  a  detector  to  select  the  most  appropriate  estimator  for  each  realization  of 
the  signal.  This  is  illustrated  by  examples,  and  a  general  explicit  solution  for  a  continuous¬ 
time  Gaussian  is  presented. 

McCluskey,  EJ,  Wakcrly,  John  F,  Ogus,  RC,  Current  Research,  Technical  Report  100,  October 
1975,  Center  for  Reliable  Computing,  Stanford  University 

Abstract 

This  technical  report  summarizes  the  results  of  three  projects  related  to  computer 
reliability  and  fault  tolerance.  The  first  project  concerns  the  theory  of  faults  in  logic  systems 
and  centers  around  the  role  of  redundancy  in  the  design  of  reliability  models.  This  first 
project  is  very  extensively  annotated.  The  second  study  reported  upon  deals  with  the  main¬ 
tainability  of  computers,  while  the  third  project  lends  itself  to  the  area  of  dual  computer  con¬ 
figurations  as  applied  to  guidance  and  navigation  systems.  The  report  lists  many  works  in  the 
field  of  computer  reliability  and  fault  tolerance. 

Smith,  JE,  The  Design  of  Totally  Self-Checking  Combinational  Circuits,  Report  R-737,  thesis, 
August  1976,  University  of  Illinois,  Urbana,  IL 

Abstract 

Paper  discusses  the  design  of  self-checking  combinational  circuits.  Emphasis  is  given 
to  the  design  of  “totally  self-checking  check  circuits,”  since  a  totally  sell-checking  circuit  is 
of  little  value  if  it  itself  cannot  be  checked.  Fault  models  are  discussed.  Circuit  designs  which 
are  totally  self-checking  with  respect  to  unidirectional  and  single  stuek-at  faults  are  considered. 
Also,  other  sets  of  faults  such  as  shorts,  transients,  and  intermittent  failures  are  discussed. 

Smith.  JE,  Metze,  G,  On  the  Existence  of  Combinational  Networks  with  Arbitrary  Multiple 
Redundancies.  Report  R-692.  October  1975.  University  of  Illinois,  Urbana.  IL 

Abstract 

Tlie  report  presents  a  few  tools  (math  models,  etc)  for  studying  multiple-line  redun¬ 
dancies  lor  which  any  proper  subset  is  irredundant.  Also  explored  are  several  new  examples 
of  multiple  redundancies,  and  it  is  proved  by  a  constructive  method  that  redundancies  of  all 
multiplicities  exist.  The  authors  state,  however,  that  unintentional  redundancy  increases  the 
costs  of  combinational  logic  networks  and  makes  it  difficult  or  impossible  to  diagnose  all 
faults  in  the  network. 
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Smith,  JE,  Metze,  G,  General  Design  Rules  for  the  Construction  of  m-out-of-n  Totally  Self- 
Checking  Checkers,  Report  R-693,  October  1975,  University  of  Illinois,  Urbana,  1L 

Abstract 

This  report  presents  a  set  of  conditions  that  characterize  a  class  of  realizations  of 
m-out-of-n  code  totally  self-checking  checkers.  A  design  method  for  minimal  two-level  check¬ 
ers  is  given.  A  relationship  between  m  and  n  is  shown  using  a  theorem  from  combinatorics. 
Paper  goes  on  to  report  the  achievement  of  fault-tolerant  systems  through  self-checking. 
Properties  of  totally  self-checking  checkers  are  given  by  defining  circuits  in  terms  of  self¬ 
testing,  fault-secure,  totally  self-checking,  code  disjoint,  and  totally  self-checking  checker. 

Susskine,  AK,  Diagnostics  for  Logic  Networks,  IEEE  Spectrum,  p  40-47,  October  1973 

Abstract 

Paper  centers  about  the  problem  of  testing  large  network  arrays  with  a  substantial 
lack  of  test  points.  An  examination  of  functional  and  structural  testing  is  made  with  most  of 
the  paper  zoning  in  on  structural  testing  techniques.  Fault  insertion  vs  fault  simulation  is  dis¬ 
cussed  as  is  the  difference  between  the  popular  D-algorithm  and  the  Boolean-difference 
approach.  The  author  favors  the  incorporation  of  diagnostics  into  design  rules,  and  criticizes 
the  development  of  diagnostics  after  the  design  has  been  completed. 

NOTE:  In  the  3  years  since  this  paper,  the  classical  methods  of  test  generation  (D-algorithm 
and  Boolean-difference)  have  been  made  useless  by  the  doubtfulness  of  stuck-at  models,  com¬ 
plexity  of  chips,  and  unavailability  of  gate  models  for  LSI  units. 

Yau,  SS,  Cheung,  RC,  Design  of  Self-Checking  Software.  Proceedings  of  1975  International 
Conference  on  Reliable  Software 

Abstract 

This  particular  paper  gives  visibility  to  different  techniques  for  constructing  a  piece  of 
self-checking  software  for  systems  requiring  ultrareliability.  Various  self-checking  capabilities 
(checks  of  functions,  software  errors,  incorrect  loop  terminations,  control  sequence  data  for  a 
process,  illegal  branches,  etc)  can  be  implemented  during  the  initial  stage  of  program  develop¬ 
ment.  The  paper  evaluates  the  cost-effectiveness  of  each  technique  in  the  particular  operating 
environment. 
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